# Handbook of Mathematical Geosciences B. S. Daya Sagar · Qiuming Cheng Frits Agterberg Editors

Fifty Years of IAMG

Handbook of Mathematical Geosciences

B. S. Daya Sagar • Qiuming Cheng Frits Agterberg Editors

# Handbook of Mathematical Geosciences

Fifty Years of IAMG

*Editors* B. S. Daya Sagar Systems Science and Informatics Unit Indian Statistical Institute–Bangalore Centre Bengaluru India

Qiuming Cheng State Key Lab of Geological Processes and Mineral Resources China University of Geosciences Beijing China

Frits Agterberg Geological Survey of Canada Ottawa, ON Canada

ISBN 978-3-319-78998-9 ISBN 978-3-319-78999-6 (eBook) https://doi.org/10.1007/978-3-319-78999-6

Library of Congress Control Number: 2018937688

© The Editor(s) (if applicable) and The Author(s) 2018. This book is an open access publication.

**Open Access** This book is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this book are included in the book's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the book's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Cover illustration: Presidents of the International Association for Mathematical Geosciences (IAMG). From Left to Right and Top to Bottom: First Row: IAMG Logo, William Christian Krumbein (First Past President), Andrei B. Vistelius (1968–1972), Richard A. Reyment (1972–1976), Daniel F. Merriam (1976–1980), Second Row: E. H. Timothy Whitten (1980–1984), John C. Davis (1984–1989), Richard B. McCammon (1989–1992), Michael Ed. Hohn (1992–1996), Ricardo A. Olea (1996–2000), Third Row: Graeme Bonham-Carter (2000–2004), Frits P. Agterberg (2004–2008), Vera Pawlowsky-Glahn (2008–2012), Qiuming Cheng (2012–2016), Jennifer McKinley (2016–2020).

Printed on acid-free paper

This Springer imprint is published by the registered company Springer International Publishing AG part of Springer Nature

The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

*Dedicated to Daniel F. Merriam and Richard A. Reyment (Fathers of the IAMG)*

### **Foreword**

The International Association for Mathematical Geosciences (IAMG) was founded during the 23rd International Geological Congress in Prague, August 1968. Within the Earth Sciences, the IAMG has played a prominent role during the past 50 years by living up to its mandate to promote, worldwide, the advancement of mathematics, statistics, and informatics in the geosciences. Under its auspices there have been and continue to be important developments in applications of mathematics, statistics and computer science in the Earth Sciences. To give two examples: IAMG members Georges Matheron and Jean Serra developed geostatistics and mathematical morphology resulting in methods that are now widely applied in other branches of science and engineering; John Aitchison invented methods to circumvent the problem of spurious correlations that often arise in compositional data analysis of petrological and geochemical data. IAMG members later followed up on developing this topic now used in other fields of science and in the social sciences as well. During the first 30 years of its existence, IAMG stood as the abbreviation of International Association for Mathematical Geology, but its current name was adopted to widen its scope and provide a home to scientists who are not only geologists but who perform research in other fields of science and engineering. From the beginning, prominent mathematical statisticians including John Tukey, Geoffrey Watson, and Franklin Graybill played a prominent part within the IAMG by providing advice and collaborating in research projects.

In addition to organizing or co-sponsoring international conferences, workshops, and lecture series, the IAMG established three successful scientific journals: Mathematical Geosciences, Computers & Geosciences, and Natural Resources Research (formerly: Nonrenewable Resources). In total, five types of IAMG awards were created to honor William Christian Krumbein (1902–1979), Andrew Borisovich Vistelius (1915–1995), John Cedric Griffiths (1912–1992), Felix Chayes (1916–1993), and Georges Matheron (1930–2000), who were pioneers in mathematical geology. The book in front of us "Handbook of Mathematical Geosciences: Fifty Years of IAMG" published to celebrate the Golden Anniversary of the IAMG contains 45 chapters prepared by IAMG award winners, founding members, and distinguished lecturers. It covers new theoretical developments, applications, reviews of subfields of the mathematical geosciences, and historical information on the IAMG, especially in its early years.

Bill Krumbein, as a geologist, first started using a digital computer in 1958, and gradually more mathematical geologists began working with digital computers in the 1960s. This involved the development of computer programs written in FORTRAN or ALGOL to use existing statistical techniques such as analysis of variance, multiple regression, multivariate statistical techniques, and time series analysis that had been developed during the first half of the twentieth century. Also, new methods including trend surface analysis and geostatistical ore reserve estimation techniques were developed specifically for solving geoscience problems. Dan Merriam established the "Kansas Geological Survey Computer Contributions." In this series, 50 computer programs were published between 1966 and 1970. During this time period, Dick Reyment worked closely with Dan to establish the IAMG.

Computers brought about further important changes that were rapidly adopted by mathematical geologists including geographic information systems (GIS), exploratory data analysis, the fast Fourier transform, mathematical morphology, fractals, and nonlinear models. Even more recently, our world has entered the "Big Data" era, with the production of data with unprecedented speed and in large quantities. The new knowledge obtained through digital analysis and the novel methods of data mining are greatly benefitting human decision making. People's life, working, and thinking are being subjected to drastic changes. "Big Data" resulted in the emergence of "Data Science" which, to some extent, is affecting all fields of science both in how scientific research is being conducted using digital data and by facilitating the use of scientific methods to study the digital data.

Nowadays, geosciences and geological research are mainly characterized by the following words: "Systematic," "Comprehensive," "Quantitative," "Threedimensional," "New-model," "Green," "Intelligent," and "Beneficial to People." In this regard, Mathematical Geosciences and the IAMG play an increasingly important role, prompting the advancement of the geosciences in the future. Earth science and geological studies are data-intensive. If we want to solve geological problems and use the results in a meaningful way, we have to obtain and work with many different kinds of data obtained by using sound geological concepts and methods borrowed from physics, chemistry, and remote sensing. Geoscience experts in the latter fields of science make invaluable contributions to our understanding of the Earth and the geological processes that took place millions of year ago. In all these endeavors, mathematics plays a significant role. This is where the IAMG is exceedingly helpful. Geology is characterized by the four "Deeps": Its data and processes are deep in the Earth, deep under the sea, deep in outer space, and deep in time. It is not easy to obtain comprehensive geological data sets in practice. Data collection can be very expensive. Much attention is to be paid to costs and benefits.

Earth scientists should always do their best to define target populations from which truly representative samples are to be drawn. Geological samples almost never fully comprise the entire population of study because of differences in space and time. There is no "overall data completeness" or "comprehensive data" in geological science and practice. Other methods of data collection have to be developed and used in order to make the random samples fit the target populations as closely as possible so that information loss because of spatial restrictions is minimized.

The ultimate purpose of Earth Science is to promote progress and development of human society: The products of the Earth's evolution over millions of years are to be used to our advantage, and we have to guard against the negative effects of the different types of disasters that can be associated with geological processes. Geological data have particular characteristic features that reflect time and cause of origin, spatial environments, and genesis. They can manifest different outcomes reflecting spatial and temporal conditions. When faced with geological data, one should not only know the "What?" but also the "Why?" and the "How?" for the data: What they truly mean and how they are to be used. One should not only establish "correlations" but also "causality" and spatiotemporal relations. Geology differs from most other areas in the Big Data era in that the focus is on the "What?" only and on correlations without causality and the "Why."

The laws of physics and chemistry have not changed through geologic time. This fact underlies the principle of actualism already understood by geologists in the nineteenth century. Some early geologists already surmised that the ice ages of which the effects can be clearly seen on the surface of the Earth were caused by minor systematic fluctuations in amount of radiation received from the sun. A full explanation of this periodicity was provided in the theory of Milankovitch. This theory currently is used to estimate ages of stage boundaries in the geologic timescale during the past 65 million years with a precision that is better than precisions provided by geochronological dating methods.

The age of the Earth is about 4.5 billion years, and it is in its middle age. Taking 90 years as expectation of human age, for example, this means that one year in our life is approximately equivalent to 50 million years in the past of the Earth. Thus, the factor of difference is about 4,500,000,000/90 = 50,000,000. The following examples illustrate the change of perspective needed to understand geological processes. Earthquakes with a magnitude greater than 8.0 earthquakes on the Gutenberg–Richter scale occur about once a year. Consequently, about 50 million such earthquakes probably have occurred over the last 50 million years. The speed of tectonic plates is of the order of 1–10 cm/year. Thus, plates have moved 500– 5000 km per 50 million years. It explains why oceans are opening and closing over geologic time.

Early in the nineteenth century, it became known that most coal deposits originated during the Carboniferous. More recently, Earth scientists have developed theories about the genesis of ore and hydrocarbon deposits that help to make new discoveries. Recognition of importance of bio-factors has aided in the understanding of various geological processes including ore and hydrocarbon formation, as well as distribution of pollutants in the ecosystem. Increasingly, mathematics and statistics are fruitfully employed in the discovery process as abundantly exemplified in many of the chapters in this Handbook. All of the preceding considerations illustrate the complexity and particularities of geological data as well as their usefulness and importance. Fully comprehensive geological data collection, their effective computer-based treatment, rational analysis, and translation into digital knowledge, all depend on the guidance provided by powerful theory based on mathematics with applications of efficient methods.

Initially, most IAMG members were located within the USA or Europe. These regions continue to have relatively many members, but China and other Asian countries now also constitute a large regional group. In 1990, a workshop was organized at the China University of Geosciences in Wuhan at which the participants included Richard McCammon, IAMG President at the time as well as four future IAMG Presidents. Now, the IAMG's China Section holds annual meetings attended by hundreds of mathematical geoscientists. Increasingly, it became felt that mathematical geoscience is making an indispensable contribution in China to aid in the prediction of occurrences of mineral resources, especially in the non-traditional regions such as deep Earth and in covered regions and the assessment of hazards such as earthquakes and landslides. As society develops from its industrialization to post-industrialization stage, environmental and ecological applications become increasingly important to establish and reduce the effects of regional patterns of pollution. Other anticipated areas of applications are urban space utilization and agricultural products under the new concepts of green and low-carbon development.

Beijing, China Pengda Zhao Academician of the Chinese Academy of Sciences, China University of Geosciences Frits Agterberg Geological Survey of Canada

Ottawa, Canada February 2018

### **Preface**

The International Association of Mathematical Geosciences (IAMG) was formed in 1968, and the year 2018 is marked as its Golden Anniversary. The "Handbook of Mathematical Geosciences: Fifty Years of IAMG" released during the IAMG Conference held at Olomouc and Prague (Czech Republic), September 2–8, 2018, motivates readers including professional geomathematicians, and undergraduate and postgraduate students to learn about the fifty years of contributions by award-winning mathematical geoscientists. This book that showcases the success of the IAMG celebrating its fifty years of existence is a compilation of 45 chapters. Compiled by academics, scientists, and engineers who are the recipients of IAMG's accolades such as the Krumbein Medal/Chayes Prize/Vistelius Award/Griffiths Award/Matheron Lectureship/Distinguished Lectureship/Honorary Membership as well as IAMG Founding Members, this Handbook covers 45 chapters on topics such as mathematical geosciences, mathematical morphology, geostatistics, fractals and multifractals, spatial statistics, multipoint geostatistics, compositional data analysis, informatics, geocomputation, numerical methods, and chaos theory in the geosciences categorized broadly into theory, general applications, exploration and resource estimation, reviews, and reminiscences. Unique features of this book include the following:


The first ten chapters are categorized as theoretical, followed by seven chapters (from 11 to 17) in the general applications part. Chapters 18–26 and 27–35 are, respectively, categorized as exploration and resources estimation, and reviews. The last ten chapters (from 36 to 45) are categorized as reminiscences. What follows includes a brief summary for each of the chapters of the Handbook.

Chapter 1 by Dubrule reviews relationships between Bayesian methods, geostatistics, and ensemble Kalman filtering which are well discussed and reviewed. The author rightly mentions that (i) inversion techniques are not discussed and (ii) fast-growing machine learning algorithms are challenging the geostatistical and Bayesian formalisms.

In Chap. 2, Baddeley compares and contrasts various statistical methods–such as logistic regression, Poisson point process models, maximum entropy, monotone regression, nonparametric survey estimates, recursive partitioning, and receiver operating characteristic curves–for predicting the occurrence of mineral deposits.

Chapter 3 by Schaeben is concerned with testing joint conditional independence of categorical random variables with a newly proposed standard likelihood ratio test. How it resolves limitations obvious with "omnibus" and "new omnibus" tests is explained with a strong theoretical basis invoking the Hammersley–Clifford theorem.

The sample space approach for modeling compositional data is explained in Chap. 4 by Egozcue and Pawlowsky-Glahn. Interestingly, perturbation between elements and its opposite, i.e., difference perturbation, appear to be Matheron– Serra's morphological dilations and erosions or Minkowski additions and subtractions. Repeated perturbations and their inverted versions (difference perturbations) seem to be multiscale morphological dilations and erosions.

Possible methods required to refocus and streamline expert geological judgment inputs along with analytical methods are reviewed by Kaufman in Chap. 5.

Remotely sensed satellite data acquisition via various sensing mechanisms pose challenges particularly in developing filters meant for feature extraction or retrieval. Many developed filters yield promising results, but could not be generalized due to varied complexities involved in sensing mechanisms leading to the acquisition of different types of satellite images. For instance, filters that work fine for satellite images acquired via optical sensing mechanisms would not yield appropriate results for those images acquired via microwave sensing mechanisms. Besides, satellite images now available are with a large number of channels at high spatial/temporal/ spectral resolutions making the ability to map features with high degree of precision more challenging. However, due to availability of filters that cannot be generalized for images acquired by different mechanisms, there is a need for the development of filters with strong theoretical basis. Cressie contributes rich content in Chap. 6, and the ideas provided in this chapter are of fundamental importance.

Deutsch in his Chap. 7 provides convincing arguments/discussions that are logical and powerful on why the ensemble of realizations needs to be considered instead of one single realization for proper planning, decision making, and uncertainty assessment.

In the past forty years, how criteria and arguments are employed in comparing binary coefficients in multivariate statistical analysis is reviewed in Chap. 8 by Hohn.

Armstrong, Mondaini, and Camargo provide a sociological study based on Google retrievals in Chap. 9. How research in geosciences diffuses within academia and into industry is studied in this chapter, whereby the research idea employed is plurigaussian simulation invented in France. This study is someway related to "scientometrics." The obvious choice to carry out this type of study is complex network based analysis, small-world network analysis (due to Duncan Watts and Steven Strogatz). Such ideas in social network analysis were predominantly developed by Barabasi and his group.

In the first part of Chap. 10, Cheng gave an excellent overview chronologically on how mathematical geosciences or geomathematics evolved in the last fifty years by also providing (i) historical connections between the mathematics and the geosciences, and (ii) a new definition of mathematical geosciences. An introduction to fractal density and singularity analysis and related subjects to solve geological problems discussing geological principles with case studies related to earthquakes is provided in the second part of this chapter. Cheng demonstrated the application of his original concept of fractal density and the local singularity to model the clustering frequency of earthquakes of the Pacific subduction zones. Much stronger singularity is discovered via the clustering frequency of earthquakes in the colder and older western boundaries of Pacific plates than that of the hotter and younger eastern boundaries of the Pacific plates.

Use of electrofacies in reservoir characterization is provided with demonstration on a giant clastic oil reservoir, the Amal field of Libya, in Chap. 11 by Davis.

In Chap. 12, morphological medians and weighted morphological medians are employed by Serra in a new elegant approach demonstrated on shoreline extrapolations. Quench stripe generation, based on these novel two types of medians provides the main basis in predicting the locations of the shorelines.

A comprehensive review of geostatistical methods to analyze remote-sensing data is presented in Chap. 13 by Militino, Ugarte, and P´erez-Goya. This review highlights the importance of geostatistics in processing and analysis of remotely sensed satellite data available in multiple spatial/temporal/spectral resolutions acquired via a host of different sensing mechanisms.

Chapter 14 by Goovaerts contains an interesting first application of space–time geostatistics to assess lead levels recorded in drinking water of public distribution system in Flint, Michigan.

Statistical Parametric Mapping (SPM)—popular in medical imaging science to evaluate differences between individual pairs of images or average images—applied on examples drawn from environmental and geoscience contexts is reviewed in Chap. 15 by McKenna. Extending the application of SPM to the hundreds of channels of hyperspectral remotely sensed satellite data would provide new insights into remote-sensing scientists.

In the interesting Chap. 16, Buccianti shows how compositional data analysis has a role in dealing with water chemistry. The author puts Illya Prigogine's ideas and concepts (including dissipative structures, dynamical systems, open and closed systems that respectively draw energy from external sources and from within, self-organized criticality, universal power laws, time irreversibility) into a new perspective. It reminds the reader of the popular book on Chaos: Man's New Dialogue with Nature by Illya Prigogine and Isabelle Stengers.

Chapter 17 by Grunsky, Drew, and Smith is the outcome of a major project concerned with soil geochemical analyses in the USA via principal component analysis and compositional framework approach. The material is presented with many maps, tabular data, and supplementary information.

Work carried out across three decades by Dowd and his group on the quantification of uncertainty in mineral/energy/environmental applications via various approaches is reviewed with a focus on mineral and energy resources, and environmental applications in Chap. 18.

Olea in Chap. 19 explains uncertainty, geostatistics, and kriging methods on the basis of a coal seam example. Three ad hoc methods, namely distance analysis, kriging, and stochastic simulation, are employed for evaluation of their usage for predicting changes in uncertainty due to changes in spatially correlated samples. Also included is a demonstration of the efficacy of these methods on real data for the Anderson coal bed. It is inferred that the stochastic simulation-based approach outperforms distance and kriging-based methods.

The topic in relation to predicting molybdenum deposit growth as a function of cutoff grade via a nonlinear model constructed by using data from several deposits is addressed in Chap. 20 by Schuenemeyer, Drew, and Bliss. Predicting molybdenum deposit growth cutoff grades is decided on the basis of a prior model derived by plotting cutoff grade as a function of deposit grade.

Chapter 21 by Pan provides a discussion with focus on several aspects of mineral resources, mineral resource estimation, and associated features with more information on how/why details provided in this chapter are of fundamental importance.

Mineral resource assessment problems and involved three types of errors are discussed in Chap. 22 by Singer. Also presented in this chapter are possible ways to avoid these errors. The chapter is written in a way that can be understood by non-mathematicians or non-statisticians.

In Chap. 23 by Bonham-Carter and Grunsky, two exploratory multivariate methods, namely proximity regression and residual principal component analysis, are applied to analyze geochemical survey data. The first method is useful in making predictions of spatial proximity to geological features, whereas the second method is a recommended way for partitioning geochemical elements into clusters.

Chapter 24 by Doveton is concerned with an approach to compositional data analysis that is significantly different from the Aitchison/Pawlowsky-Glahn/ Egozcue approach to CoData problem-solving.

Two parts of Chap. 25 by Soares and Azevedo, respectively, provide the (i) state of the art in recent geostatistical seismic inversion methods and their applications to evaluate reservoir properties, and (ii) seismic inversion-based methodology to assess uncertainty and risks at early stage of exploration.

In Chap. 26, Agterberg provides rich information-related studies to understand the differences in the degree of heterogeneities in the spatial distribution of metal deposits between the regional level and global level. It is interesting to see that de Wijs' work formed the basis for this new version of the model that provides a framework for explaining difference between regional and worldwide distributions. The de Wijs model has also been used elsewhere in the iterated bisection process to compute multifractal spectra that provide a host of dimensions such as topological dimension, capacity dimension, and information dimension. A host of such dimensions is of immense use to understand not only spatial but also temporal distribution patterns.

Chapter 27 by Caers provides views on why philosophical principles are required to be translated into workable practices.

Various approaches involving spatial statistics, geological variables, geometry and topology of geological objects to develop coherent Earth models are well documented as an excellent review in Chap. 28 by Caumon.

Origins of kriging, its success, and its new application domains across the last five decades, and the role of IAMG journals popularizing this technique by publishing in English are explained in Chap. 29 by Chilès and Desassis.

Recent advances in Multiple-Point Statistics (MPS)—that is important and significant in handling complex and realistic phenomena of relevance to the Earth sciences—are thoroughly reviewed in Chap. 30 by Tahmasebi.

Mariethoz provides interesting views on the conceptual differences between the concurrent approaches of Minimum Point Statistics and Covariance-Based Geostatistics in Chap. 31 with an illustrated example.

Srivastava provides information on the origin of Multiple-Point Statistics (MPS) algorithms along with many personal reminiscences in Chap. 32.

Chapter 33 by van den Boogaart and Tolosana-Delgado contains useful new proposals. This chapter provides state of the art and mathematical building blocks for solutions in predictive geometallurgy—i.e., the understanding of geometallurgy. The chapter further explores possible links between geometallurgical problems and relevant techniques taken from mathematical geosciences. From the insights provided into this chapter, the next generation of mathematical geoscientists and experts in geoinformatics would surely benefit.

Chapter 34 by Ma provides possible links between mathematical geosciences and Data Science. Many learning techniques such as artificial intelligence, active learning, machine learning and intelligence, and deep learning approaches together now play a much bigger role in pattern discovery from massive data sets—predictive geosciences. The journey from toy models developed by nonlinear physicists to predictive models has posed several newer challenges. Data Science would bring under one umbrella the powerful theories, algorithms available under different names in different disciplines.

Daya Sagar reviews potential applications of nonlinear mathematical morphological transformations to deal with a host of challenges encountered in geosciences and Geographical Information Science (GISci) with a large number of excellent case studies shown illustratively in Chap. 35.

Many recollections by IAMG members from the old days are provided in Chap. 36 by Cubitt and Henley, with contributions provided by T. Victor (Vic) Loudon, EHT (Tim) Whitten, John Gower, Daniel (Dan) Merriam, Thomas (Tom) Jones, and Hannes Thiergärtner. Also provided in this chapter is information on those pioneering scientists who were instrumental in forming and shaping the IAMG. The chapter is immensely useful for young generation mathematical geoscientists in order to know and appreciate the hard work of peers and scientists of earlier generations.

How the applications of forward and inverse models in particular in Earth science-related problems evolved over a period of 70 years is lucidly explained in simplest possible language by Whitten in Chap. 37. Besides this, how other approaches in particular applications of scaling theories or fractal geometry and theory of chaos, in other words nonlinear approaches—that have already shown significant success in modeling and characterization of various phenomena and processes of relevance to the Earth sciences—can be foreseen in the next 50 years to give a scope for further research.

Václav Němec's professional and personal reminiscences are chronologically provided in Chap. 38 by Němec, along with details on the IAMG's formation and personal early development.

Chap. 39 by Henley provides a rounded view of the life and works, and a glimpse of the legacy of Andrey Vistelius, first President of the IAMG.

Many theoretical sound techniques, algorithms, and software tools developed have shown promising results in certain application-specific domains but with limited utility in terms of generalization. Thiergärtner's interesting and genuine views, opinions, and recommendations in Chap. 40 are thought provoking.

Application of kriging, inverse distance methods, and the variogram in multivariate data analysis, spatial estimation, and in texture-based classification are shown with simple illustrations by Carr in Chap. 41.

Full in Chap. 42 provides a review of the development and applications of a linear unmixing method fairly extensively used by geologists during the past 50 years.

Chapter 43 on Pearce Element Ratios provides insight into the evolution of melts in volcanic systems along with many personal memories and (from the point of view of compositional data analysis) a somewhat antiquated method of approach. An excellent review with extensive Skaergaard applications is provided in this chapter by Nicholls.

Myers in Chap. 44 gives a helpful set of reflections by a mathematician who adopted geostatistics as a principal field of research and has made many important contributions to the field along with personal reminiscences on IAMG and the *Journal of Mathematical Geosciences*.

Preface xvii

Agterberg in his Chap. 45 provides a holistic view on the beginnings of IAMG and about the academics/scientists/engineers who were instrumental in shaping the IAMG and making it a most successful association promoting worldwide the advancement of mathematics, statistics, and informatics in the geosciences. This chapter enlightens and motivates the young generation mathematical geoscientists.

Bangalore, India B. S. Daya Sagar Beijing, China Qiuming Cheng Ottawa, Canada Frits Agterberg

### **Contents**

### **Part I Theory**






### **Editors and Contributors**

### **About the Editors**

**B. S. Daya Sagar** is a Full Professor of the Systems Science and Informatics Unit (SSIU) at the Indian Statistical Institute. He received his M.Sc. and Ph.D. degrees in Geoengineering and Remote Sensing from the Faculty of Engineering, Andhra University, Visakhapatnam, India, in 1991 and 1994, respectively. He is also first Head of the SSIU. Earlier, he worked in the College of Engineering, Andhra University, Centre for Remote Imaging, Sensing and Processing (CRISP), and the National University of Singapore in various positions during 1992–2001. He served as Associate Professor and Researcher in the Faculty of Engineering and Technology (FET), Multimedia University, Malaysia, during 2001–2007. Since 2017, he has been a Visiting Professor at the University of Trento, Trento, Italy. His research interests include mathematical morphology, GISci, digital image processing, fractals and multifractals, their applications in extraction, analyses, and modeling of geophysical patterns. He has published over 85 papers in journals and has authored and/or guest edited 11 books and/or special theme issues for journals. He recently authored a book entitled *Mathematical Morphology in Geomorphology and GISci*, CRC Press: Boca Raton, 2013, p. 546. He recently co-edited two special issues on "Filtering and Segmentation with Mathematical Morphology" for *IEEE Journal of Selected Topics in Signal Processing* (v. 6, no. 7, p. 737–886, 2012), and "Applied Earth Observation and Remote Sensing in India" for *IEEE Journal of Selected* *Topics in Applied Earth Observation and Remote Sensing* (v. 10, no. 12, p. 5149–5328, 2017). He is an elected Fellow of Royal Geographical Society (1999), Indian Geophysical Union (2011), and was a Member of New York Academy of Sciences during 1995–1996. He received the Dr. Balakrishna Memorial Award from Andhra Pradesh Academy of Sciences in 1995, the Krishnan Gold Medal from Indian Geophysical Union in 2002, and the "Georges Matheron Award-2011 (with Lecturership)" of the International Association for Mathematical Geosciences. He is the Founding Chairman of Bangalore Section IEEE GRSS Chapter. He is on the Editorial Boards of Computers and Geosciences, and Frontiers: Environmental Informatics.

**Qiuming Cheng** did his Ph.D. degree in Earth Science under supervision of Dr. Frits Agterberg at the University of Ottawa in 1994. He spent a year at the Geological Survey of Canada as a PDF under the supervision of Dr. Graeme Bonham-Carter and soon became a Faculty Member at York University, Toronto, Canada, in 1995 with cross-appointments in the Department of Earth and Space Science and Engineering and the Department of Geography. He was promoted to associate professor in 1997 and full professor in 2002. He was awarded a Changjiang Scholar Professorship in China by the China's Ministry of Education where he has set up and leads the State Key Lab of Geological Processes and Mineral Resources (GPMR) located on both campuses of China University of Geosciences in Beijing and Wuhan. Currently, he holds a Thousand Talent National Special Professorship of China, serving as the Founding Director of the GPMR laboratory. He has specialized in mathematical geoscience with research focus on nonlinear mathematical modeling of Earth processes and geoinformatics techniques for prediction of mineral resources. He has authored and co-authored more than 300 research articles. He has been awarded several prestigious awards including the Krumbein Medal, the highest award by the International Association for Mathematical Geosciences (IAMG). He was an elected President of the International Association for Mathematical Geosciences (IAMG) during 2012–16. He is the President of International Union of Geological Sciences (IUGS) for the period between 2016 and 2020. He is an international leader in the application of nonlinear mathematics and geoinformatics to the analysis, modeling, and prediction of a wide range of geological processes and mineral resources quantitative assessment. His primary research interest involves the interdisciplinary study of nonlinear properties of the Earth's systems, as well as quantitative assessment and prediction of natural resources and environmental impacts. His research on fractal density and local singularity analysis theory and geomathematical models has made major impacts in several geoscientific disciplines, including those concerned with ocean ridge heat flow, magmatic flare-up during continent crustal growth and formation of supercontinents, earthquakes, floods, hydrothermal mineralization, and prediction of deeply buried mineral deposits.

**Frits Agterberg** is a Dutch-born Canadian Mathematical Geologist who served at the Geological Survey of Canada in Ottawa. He attended Utrecht University in the Netherlands from 1954 to 1961. With other founding members, he was instrumental in establishing the International Association for Mathematical Geosciences (IAMG) in 1968. He received the IAMG's William Christian Krumbein Medal in 1978, and he was IAMG Distinguished Lecturer in 2004. In 2017, he was conferred with the Honorary Membership of the IAMG. He has authored or co-authored over 250 scientific papers and 5 books. He has served the IAMG in many ways, including being its President from 2004 to 2008. After defending his doctoral thesis on structural geology of the Italian Alps at Utrecht University and a one-year fellowship at the University of Wisconsin in Madison, he became "petrological statistician" in his first job at the Geological Survey of Canada (GSC) in 1962. He was asked to create the GSC Geomathematics Section in 1971. He retired from the GSC in 1996 but still has an office at their Ottawa headquarters. In 1968, he became associated with the University of Ottawa where he taught a "statistics in geology" course for 25 years and has supervised six geomathematical Ph.D. students. From 1978 to 1989, he directed the Quantitative Stratigraphy Project of the International Geological Correlation Program. From 1981 to 2001, he was a Correspondent of the Royal Netherlands Academy of Arts and Sciences. During the past 20 years, primarily in collaboration with Qiuming Cheng, his colleagues, and students at the China University of Geosciences in Wuhan and Beijing and at York University, Toronto, he has worked on applications of multifractals to study the spatial distribution of metals in rocks and orebodies.

### **Contributors**

**Frits Agterberg** Geological Survey of Canada, Ottawa, ON, Canada

**M. Armstrong** Escola de Matemática Aplicada, Fundação Getulio Vargas, Rio de Janeiro, Brazil; MINES Paristech, PSL Research University, CERNA – Centre for Industrial Economy, i3, CNRS UMR 9217, Paris, France

**Leonardo Azevedo** CERENA, Instituto Superior Técnico, Universidade de Lisboa, Lisbon, Portugal

**Adrian Baddeley** Department of Mathematics and Statistics, Curtin University, Perth, WA, Australia

**James D. Bliss** Southwest Statistical Consulting, LLC, Cortez, CO, USA

**G. F. Bonham-Carter** Merrickville, ON, Canada

**Antonella Buccianti** Department of Earth Sciences, University of Florence, Florence, Italy; CNR-IGG, Unit of Florence, Florence, Italy

**Jef Caers** Stanford University, Stanford, USA

**S. Camargo** Escola de Matemática Aplicada, Fundação Getulio Vargas, Rio de Janeiro, Brazil

**James R. Carr** Department of Geological Sciences and Engineering, University of Nevada, Reno, Reno, NV, USA

**Guillaume Caumon** GeoRessources-ENSG, Université de Lorraine – CNRS– CREGU, Vandoeuvre-lès-Nancy, France

**Qiuming Cheng** State Key Lab of Geological Processes and Mineral Resources, China University of Geosciences, Beijing, China

**Jean-Paul Chilès** Centre of Geosciences, Mines ParisTech, Fontainebleau, France

**Noel Cressie** Distinguished Professor, National Institute for Applied Statistics Research Australia (NIASRA), School of Mathematics and Applied Statistics, University of Wollongong, Wollongong, Australia

**John Cubitt** Holt, Wrexham, UK

**John C. Davis** Heinemann Oil GmbH, Baldwin City, KS, USA

**B. S. Daya Sagar** Systems Science and Informatics Unit, Indian Statistical Institute-Bangalore Centre, Bengaluru, India

**Nicolas Desassis** Centre of Geosciences, Mines ParisTech, Fontainebleau, France

**Clayton V. Deutsch** University of Alberta, Edmonton, Canada

**John H. Doveton** Kansas Geological Survey, Lawrence, KS, USA

**Peter Dowd, FREng, FTSE** The University of Adelaide, Adelaide, Australia

**L. J. Drew** United States Geological Survey, Reston, VA, USA

**Lawrence J. Drew** Southwest Statistical Consulting, LLC, Cortez, CO, USA

**Olivier Dubrule** Imperial College London, London, UK

**Juan José Egozcue** Department of Civil and Environmental Engineering, Universidad Politécnica de Cataluña, Barcelona, Spain

**William E. Full** GXStat, LLC, Wichita, KS, USA

**Pierre Goovaerts** BioMedware, Inc, Jerome, MI, USA

**E. C. Grunsky** Department of Earth and Environmental Sciences, University of Waterloo, Waterloo, ON, Canada; China University of Geosciences, Beijing, China

**Stephen Henley** Resources Computing International Limited, Matlock, Derbyshire, UK

**Michael E. Hohn** West Virginia Geological and Economic Survey, Morgantown, USA

**G. M. Kaufman** Management Emeritus, E62-437, Sloan School of Management MIT, Cambridge, MA, USA

**Xiaogang Ma** Department of Computer Science, University of Idaho, Moscow, ID, USA

**Gregoire Mariethoz** Institute of Earth Surface Dynamics (IDYST), University of Lausanne, Lausanne, Switzerland

**Sean A. McKenna** IBM Research, Dublin, Ireland

**A. F. Militino** Department of Statistics and O.R., Public University of Navarra (Spain), Pamplona, Spain; InaMat (Institute for Advanced Materials), Pamplona, Spain

**R. Mohan Srivastava** TriStar Gold Inc., Toronto, ON, Canada

**A. Mondaini** Department of Physics, UERJ, Rio de Janeiro, Brazil

**Donald E. Myers** Department of Mathematics, University of Arizona, Tucson, AZ, USA

**J. Nicholls** Department of Geoscience, University of Calgary, Calgary, AB, Canada

**Václav Němec** Praha 10 - Strašnice, Czech Republic

**Ricardo A. Olea** U.S. Geological Survey, Reston, VA, USA

**Guocheng Pan** China Hanking Holdings, Shenyang, Liaoning, People's Republic of China

**Vera Pawlowsky-Glahn** Department of Computer Science, Applied Mathematics and Statistics, University of Girona, Girona, Spain

**U. Pérez-Goya** Department of Statistics and O.R., Public University of Navarra (Spain), Pamplona, Spain

**Jean Serra** Ecole des Mines de Paris, Paris, France

**Helmut Schaeben** Geophysics and Geoinformatics, TU Bergakademie Freiberg, Freiberg, Germany

**John H. Schuenemeyer** Southwest Statistical Consulting, LLC, Cortez, CO, USA

**Donald A. Singer** Cupertino, CA, USA

**D. B. Smith** United States Geological Survey, Denver, CO, USA

**Amílcar Soares** CERENA, Instituto Superior Técnico, Universidade de Lisboa, Lisbon, Portugal

**Pejman Tahmasebi** Department of Petroleum Engineering, University of Wyoming, Laramie, WY, USA

**Hannes Thiergärtner** Department of Geosciences, Free University of Berlin, Berlin, Germany

**R. Tolosana-Delgado** Helmholtz Institute Freiberg for Resource Technology, Freiberg, Germany

**M. D. Ugarte** Department of Statistics and O.R., Public University of Navarra (Spain), Pamplona, Spain; InaMat (Institute for Advanced Materials), Pamplona, Spain

**K. G. van den Boogaart** Helmholtz Institute Freiberg for Resource Technology, Freiberg, Germany

**E. H. Timothy Whitten** Riverside, Widecombe-in-the-Moor, Devon, UK

### **Part I Theory**

### Chapter 1 Kriging, Splines, Conditional Simulation, Bayesian Inversion and Ensemble Kalman Filtering

### Olivier Dubrule

Abstract This chapter discusses, from a theoretical point of view, how the geostatistical approach relates to other commonly-used models for inversion or data assimilation in the petroleum industry. The formal relationship between point Kriging and splines or radial basis functions is first presented. The generalizations of Kriging to the estimation of average values or values affected by measurement errors are also addressed. Two algorithms are often used for conditional simulation: the "rough plus smooth" approach consists of adding a smooth correction to a non-conditional simulation, whilst sequential Gaussian simulation allows the point-by-point construction of the realizations. As with Kriging, conditional simulation can be applied to average values or to data affected by measurement errors. Geostatistical inversion generates high-resolution realizations of vertical impedance traces constrained by seismic amplitudes. If the relationship between impedance and amplitude data is linearized, geostatistical inversion is a particular case of Bayesian inversion. Because of the non-linearity of production data vis-à-vis the variables of the earth model, their assimilation is harder than that of seismic data. Ensemble Kalman filtering, if considered from a geostatistical viewpoint, consists of using a large number—or ensemble—of realizations to calculate empirical covariances between the dynamic data and the parameters of the geostatistical model. These covariances are then used in the equations for interpolating the mismatch between simulated and new production data using a coKriging-like formalism. Interestingly, most of these techniques can be expressed using the same generic equation by which an initial model not honouring some newly arrived data is made conditional to these data by adding a (co-)Kriged interpolation of the data mismatches to the initial model. In spite of their similar equations, Bayesian inversion, geostatistics and ensemble Kalman filtering have a different approach to the inference of the covariance models used by these equations.

Keywords Dual Kriging ⋅ Radial basis functions ⋅ Geostatistical inversion Energy-based methods ⋅ Prediction error filter

© The Author(s) 2018

O. Dubrule (✉)

Imperial College London, London, UK e-mail: o.dubrule@imperial.ac.uk

B. S. Daya Sagar et al. (eds.), Handbook of Mathematical Geosciences, https://doi.org/10.1007/978-3-319-78999-6\_1

### 1.1 Introduction

Fifty years ago, when geostatistics was pioneered by Matheron (1971), its main applications were Kriging and the change of support for mining applications. At the time, geostatistics was presented as a new discipline, without much reference to its relationships with other mathematical interpolation and modeling techniques. This has now changed as the relationships between geostatistics and such techniques as splines, regularization, Bayesian inversion, or ensemble Kalman filtering have become clearer. This convergence is fascinating and has led to many significant developments allowing the integration of multi-disciplinary data into 3-D geostatistical earth models.

This chapter discusses approaches for generating 2-D or 3-D subsurface models constrained by geological (wells), seismic or dynamic data. In spite of the wealth of data available, the uncertainty on the 3-D earth model remains high in most cases. Approaches that are designed to generate one unique "deterministic" model often pick the smoothest one. This is not realistic in situations where the Earth Model is used for flow simulation, as the results are biased if the model heterogeneities are not representative of that of the actual reservoir. More generally, non-linear operations, such as the application of cut-offs, may give biased results when applied to deterministic smooth models such as those produced by Kriging.

The multi-realization approach is now routinely applied to subsurface parameters inversion. Looking at the mean provides much less information than looking at a movie of realizations. …By construction, each of the realizations captures the essential random fluctuations of the actual field from which the data were extracted (Tarantola 2005). This is a fundamental change. The traditional inversion approach could be formulated as "How to find an estimate of the spatial parameters which is as close as possible to the first guess values of these parameters and which provides, through forward modeling, an output which is as close as possible to the available data" (modified from Evensen 2007). These first guess values are usually a smooth (Kriging-like) spatial model of these parameters. Now the question has changed to "Find the probability density function (pdf) of 3-D models constrained by all the existing data, and provide techniques for sampling realizations from this pdf".

This chapter, written from a geostatistical perspective, discusses the convergence between the existing techniques.

Deterministic approaches such as Kriging, splines, regularization- or energy-based methods generate a single model of the subsurface, which usually minimizes or maximizes an optimisation criterion. These approaches are closely related and their formal relationships are discussed.

Geostatistical simulation is then revisited, and two key simulation algorithms are discussed; The first one is sequential Gaussian simulation and the second one is the "rough plus smooth" combination of an unconditional simulation plus a smooth correction term. These two algorithms have helped bridge the gap between geostatistics and inversion.

Two successful approaches are then discussed for integrating seismic and dynamic data into the earth model. Rather than using an approach merely based on statistical correlations between data and model parameters, it is assumed that there exists a deterministic relationship (or forward model) between model parameters and data, possibly including a random error.

The first approach, geostatistical inversion, produces reservoir-scale models of acoustic or elastic parameters constrained by single- or multi-offset seismic amplitude data. The value of using sequential Gaussian simulation to calculate seismically-constrained realizations is discussed. In situations where the forward model is linear, geostatistical inversion can be formulated as a particular case of Bayesian seismic inversion.

The second approach, ensemble Kalman filtering, consists of sequentially updating an "ensemble" of geostatistical realizations using dynamic data as they are acquired in time. The key idea here is to statistically derive the covariance terms of the equation used in Bayesian inversion from an ensemble of realizations rather than from a theoretical covariance model. The formal relationship between ensemble Kalman filtering and co-Kriging is discussed.

Most of the above techniques can be shown to use the same kind of formalism, where the mismatch between newly arrived data and the current model is interpolated and used to update this model.

One of the conclusions of this chapter is that the equations of Bayes, geostatistics or ensemble Kalman filtering are closely related. However, this relationship is mostly formal as the three techniques differ in their approach to the covariances used in the equations. Geostatisticians first fit a model to the data, whilst Bayesians start from a model based on general "prior" information. Only later in the process do they introduce the well data. And ensemble Kalman filtering directly uses the experimental covariances calculated from the realizations of the ensemble.

The topic of joint inversion of seismic and dynamic data is not discussed here, in spite of the interesting on-going developments in 4-D seismic data inversion. This is because the objective of this chapter is to address formal relationships between the different formalisms rather than discuss specific applications.

### 1.2 Deterministic Aspects of Geostatistics

### 1.2.1 Simple Stationary Kriging

The basic model used by geostatistics is that of stationary random functions of order 2: a spatial property zð Þx at location x is represented by a random function Zð Þx , which is assumed to follow a trend mð Þx and a stationary covariance Cð Þ h

$$m(\mathbf{x}) = E(Z(\mathbf{x})) \tag{1.1a}$$

$$C(\mathbf{h}) = E(Z(\mathbf{x})Z(\mathbf{x} + \mathbf{h})) - E(Z(\mathbf{x}))E(Z(\mathbf{x} + \mathbf{h})) \tag{1.1b}$$

At each unsampled location x, the value of Zð Þx is estimated by a linear combination Zkð Þx of the values Zi =Z xi ð Þ at the n data points ð Þ x<sup>i</sup> <sup>i</sup>= 1, ..., <sup>n</sup>. Kriging is the best linear unbiased estimator, in the sense that it is unbiased and that it minimizes the estimation variance. If the trend mð Þx is known at each location x, the simple Kriging (Chilès and Delfiner 2012, p. 151) system of equations is obtained

$$Z\_k(\mathbf{x}) - m(\mathbf{x}) = \sum\_{i=1}^n \lambda\_i (Z\_i - m(\mathbf{x}\_i)) \tag{1.2a}$$

$$\text{with } \sum\_{i=1}^{n} \lambda\_i C \left(\mathbf{x\_i} - \mathbf{x\_j}\right) = C \left(\mathbf{x} - \mathbf{x\_j}\right) \text{ for } j \in \left(1, \ldots, n\right) \tag{1.2b}$$

### 1.2.2 Kriging with Intrinsic Random Functions of Order k

Matheron (1973) generalized the above model to that of Intrinsic Random Functions of Order k (IRF-k), where the definition of the variogram as a generalized covariance of order zero and of generalized covariances of order k leads to a model based on the stationarity of generalized increments of order k.

With k-IRFs, the model only considers linear combinations of <sup>Z</sup>ð Þ<sup>x</sup> that filter polynomials of order k (such polynomials being likely to represent a trend). Simple Kriging is not applicable any more. For instance, if k = 1 in two dimensions, and if Kð Þ h designates the generalized covariance of order k (GC-k), the kriging system becomes

$$Z\_k(\mathbf{x}) = \sum\_{i=1}^n \lambda\_i Z\_i \tag{1.3a}$$

$$\begin{aligned} \text{with } \sum\_{i=1}^{n} \lambda\_i K \left( \mathbf{x\_i} - \mathbf{x\_j} \right) + \mu\_0 + \mu\_1 \mathbf{x\_{j1}} + \mu\_2 \mathbf{x\_{j2}} &= K \left( \mathbf{x} - \mathbf{x\_j} \right) \text{for } j \in (1, \ldots, n) \\ \text{and } \sum\_{i=1}^{n} \lambda\_i = 1 &\quad \sum\_{i=1}^{n} \lambda\_i \mathbf{x\_{i1}} = \mathbf{x\_1} &\quad \sum\_{i=1}^{n} \lambda\_i \mathbf{x\_{i2}} = \mathbf{x\_2} \end{aligned} \tag{1.3b}$$

where the coordinates of each point x of the plane are written as x= ð Þ x1, x<sup>2</sup> .

### 1.2.3 Kriging Extensions

The goal here is not to discuss the details of Kriging, as there are plenty of excellent textbooks for this (Chilès and Delfiner 2012, p. 150). However, two features of Kriging deserve to be discussed, as they facilitate the understanding of the relationship between Kriging, splines and Bayesian approaches.

### 1.2.3.1 Generalization of Kriging to the Interpolation of Average Values

Kriging is a linear interpolator. The data used by Kriging do not have to be point values, but they can be any linear function of the parameters of interest; Hansen et al. (2006) call these "volume support data". In particular, Kriging can be used to estimate the average value of a parameter ZðvxÞ at a location x by a linear combination volume of support data Zðvx<sup>i</sup> <sup>Þ</sup> (Chilès and Delfiner 2012, p. 198)

$$Z\_k(\mathbf{v}\_\mathbf{x}) = \sum\_{i=1}^n \lambda\_i Z(\mathbf{v}\_\mathbf{x}) \tag{1.4}$$

This property of Kriging, extensively used in mining applications, is of significant interest in the context of linear inversion of volume support data (Hansen et al. 2006). The Kriging equations associated with Eq. 1.4 are not given here, as they are a bit heavy, but conceptually simple thanks to the linear property of Kriging.

### 1.2.3.2 Error CoKriging

Error coKriging (Dubrule 2003) is a generalization of Kriging to the situation where measurements Yi of the parameter Zi at data points xi are affected by an unbiased random error

$$Y\_i = Z\_i + \varepsilon\_i \text{ with } E(\varepsilon\_i) = 0 \text{ and } Var(\varepsilon\_i) = C\_{\varepsilon\_i} \tag{1.5}$$

In this situation, error coKriging allows the estimation of Zð Þx at any unsampled location x from a linear combination of values Yi (the random measurement error attached to each data can be zero or not) (Dubrule 2003; Hansen et al. 2006; Chilès and Delfiner 2012, p. 216)

$$Z\_k(\mathbf{x}) = \sum\_{i=1}^n \lambda\_i Y\_i \tag{1.6}$$

### 1.2.3.3 Dual Kriging

If a global neighborhood is used, that is if all the available data are used to estimate Zð Þx at every single location x, the Kriging equations (Eq. 1.3) can be inverted to obtain the dual Kriging system (for interpolation in the case of Kriging and smoothing in the case of error coKriging). For example, in two dimensions for a k-IRF of order 1

$$z\_k(\mathbf{x}) = z\_k(\mathbf{x}\_1, \mathbf{x}\_2) = a\_0 + a\_1 \mathbf{x}\_1 + a\_2 \mathbf{x}\_2 + \sum\_{i=1}^n b\_i K(\mathbf{x} - \mathbf{x}\_i) \tag{1.7}$$

where the conditions on the coefficients <sup>ð</sup>a0, <sup>a</sup>1, <sup>a</sup>2, <sup>b</sup>1, ... , bn<sup>Þ</sup> are different for Kriging and error coKriging (Dubrule 1983)

$$\text{Kriging:} \quad \sum\_{i=1}^{n} b\_i = \sum\_{i=1}^{n} b\_i \mathbf{x}\_{i1} = \sum\_{i=1}^{n} b\_i \mathbf{x}\_{i2} = \mathbf{0} \quad \text{and} \quad \mathbf{z}\_k(\mathbf{x}\_{i1}, \mathbf{x}\_{i2}) = \mathbf{z}\_i \tag{1.8}$$

$$\text{Error coKriging: } \sum\_{i=1}^{n} b\_i = \sum\_{i=1}^{n} b\_i \mathbf{x}\_{i1} = \sum\_{i=1}^{n} b\_i \mathbf{x}\_{i2} = 0 \quad \text{and} \quad \mathbf{z}\_k(\mathbf{x}\_{i1}, \mathbf{x}\_{i2}) + b\_i \mathbf{C}\_{\mathbf{x}\_i} = \mathbf{y}\_i \tag{1.9}$$

### 1.2.4 Kriging and Splines

#### 1.2.4.1 Interpolating Splines

Splines are a popular method for deterministic interpolation and approximation (Micula and Micula 1999). In 2-D, interpolating splines calculate a function honouring the data and minimizing an energy functional. Harmonic splines minimize the stretching energy of a membrane while biharmonic splines minimize the bending energy of an elastic plate. The biharmonic spline function can be written using a similar expression as Eq. 1.7 (Duchon 1975), but with a specific model for the generalized covariance function

$$K(\mathbf{x} - \mathbf{x}\_{l}) = \left( \left( \mathbf{x}\_{l} - \mathbf{x}\_{i1} \right)^{2} + \left( \mathbf{x}\_{2} - \mathbf{x}\_{i2} \right)^{2} \right) \text{Log} \left( \sqrt{\left( \mathbf{x}\_{l} - \mathbf{x}\_{i1} \right)^{2} + \left( \mathbf{x}\_{2} - \mathbf{x}\_{i2} \right)^{2}} \right) \quad (1.10)$$

Splines and Kriging are a particular case of a more general class of interpolators, called radial basis functions (Billings et al. 2002a, b). With splines, the polynomial in Eq. 1.7 belongs to the kernel of the operator T that is minimized by the spline function (T is the gradient for harmonic splines and the laplacian for biharmonic splines), whilst the function Kð Þ h is the Green function associated with the operator T0 T, where T<sup>0</sup> is the transposed operator of T (Matheron 1981a)

$$T'TK(\mathbf{h}) = \delta \tag{1.11}$$

where δ is the Dirac Function. Choosing the energy functional minimized by splines is equivalent to fixing the degree of the trend function and the generalized covariance model for Kriging. For harmonic splines, these are respectively a constant and the De Wijs variogram in Logh (Chilès and Delfiner 2012, p. 94).

The consequence of Eq. 1.11 on the spectral density of the generalized covariance Kð Þ h is straightforward. For example, the spectral densities associated with the harmonic and biharmonic splines are power laws, representing fractal models. Szeliski and Terzopoulos (1989) and Micula and Micula (1999) discuss this relationship between Splines and fractals.

#### 1.2.4.2 Smoothing Splines

Smoothing splines are used in situations where measurements at data points are affected by a random error (Eq. 1.5). In two dimensions, they compute a function f xð Þ 1, x<sup>2</sup> minimizing the sum of a spline energy functional plus a weighted distance to the n data

$$\left\|\left\|Tf\right\|\right\|^2 + \theta \sum\_{i=1}^{n} \frac{\left(f(\mathbf{x}\_{i1}, \mathbf{x}\_{i2}) - \mathbf{y}\_i\right)^2}{C\_{\mathbf{e}\_i}} \tag{1.12}$$

The smoothing biharmonic spline function has the same expression as that of Kriging and error coKriging (Eq. 1.7) but with the following relationships

$$\sum\_{i=1}^{n} b\_i = \sum\_{i=1}^{n} b\_i \mathbf{x}\_{i1} = \sum\_{i=1}^{n} b\_i \mathbf{x}\_{i2} = 0 \quad \text{and} \quad f(\mathbf{x}\_{1i}, \mathbf{x}\_{i2}) + b\_i \frac{C\_{\varepsilon\_i}}{\theta} = \mathbf{y}\_i \tag{1.13}$$

Smoothing biharmonic splines are identical to error Cokriging as long as the generalized covariance used by error Cokriging is the function θKð Þ x− xi , where Kð Þ x− xi is given by Eq. 1.10 (Matheron 1981a; Dubrule 2003). This is a general relationship between smoothing splines and coKriging, which are formally equivalent if the generalized covariance Kð Þ h is that satisfying Eq. 1.11, with the coefficient of <sup>K</sup>ð Þ <sup>h</sup> equal to the smoothing parameter <sup>θ</sup> of Eq. 1.12.

# 1.2.4.3 Kriging and Regularization—The Discrete Case

The discrete case is the situation where interpolation is performed at the nodes of a regular grid and each data point is located at one of the nodes of this grid. If p is the total number of grid nodes, the number n of data points is such that n< p.

In the discrete case, Matheron (1981b) also demonstrated the equivalence between splines and Kriging, and between smoothing splines and error coKriging. Both the Kriged and spline values zu minimize

$$\sum\_{\mu,\nu=1}^{p} z\_{\mu} B\_{\mu\nu} z\_{\nu} + \sum\_{i=1}^{n} \frac{\left(z\_i - y\_i\right)^2}{C\_{x\_i}} \tag{1.14}$$

where the u and v indices designate all the p grid points where the interpolation takes place, whilst i indices designate the n data points. The minimization of Eq. 1.14 is performed according to the unknown values zu at all grid nodes (including those unknown values zi where a data point with measured value yi is present). The first term of Eq. 1.14 can be interpreted as a quadratic energy function traditionally used in inverse problems. In the regularization context, the choice of this quadratic form is driven by smoothing considerations, often using Briggs' finite difference Laplacian (or spline) "roughening" operator (Briggs 1974; Bolondi et al. 1976). Seen from the geostatistical perspective, Buv is the inverse of the covariance matrix in the stationary case and a pseudo-inverse of the generalized covariance matrix in the k-IRF case (Matheron 1981b). Equation 1.14 confirms the clear relationship between the inverse of the (generalized) covariance and the spline differential operator.

Kriging can thus be formalized in the frame of energy-based estimation techniques such as splines. This comes from the relationship between the inverse of the covariance function and the roughening filter implicit in the quadratic regularization term. It will be shown below that the regularization term can also be regarded, in the Bayesian inversion context, as an expression of the prior knowledge about the variable under study.

### 1.2.5 Kriging and Bayesian Inversion

### 1.2.5.1 Bayesian Linear Inversion

Here it may be useful to recall the general expression of the posterior mean and covariance in the case of Bayesian linear inversion of a multigaussian function. A very good reference for this is Tarantola (2005).

In the discrete case, consider a stationary multigaussian random vector z of dimension p containing the grid values zu over a two or three-dimensional regular grid of size p. Assume also that a vector y contains the n data yi. It is assumed again that the data are affected by an error vector ε of dimension n, and also that these data are a linear function of the p values of z over the grid

$$
\mathbf{y} = F\mathbf{z} + \boldsymbol{\varepsilon} \tag{1.15}
$$

where the vector ε has mean zero and covariance matrix C<sup>ε</sup> and F is a matrix of dimension n × p. In the multigaussian case, thanks to the Bayes formula relating the posterior pdf fpostð Þz to the prior pdf fprioð Þz and the likelihood function gyz ð Þ j , the prior mean vector m (dimension p) and covariance matrix C (dimension p × p) of z are updated using the information brought by the data vector y

$$f\_{\rm post}(z) \lhd f\_{\rm proj}(z) g(\mathbf{y}/z) \propto \exp\left[ (z-m)^{\stackrel{\prime}{\cdot}} C^{-1}(z-m) \right] \times \exp\left[ (\mathbf{y}-Fz)^{\stackrel{\prime}{\cdot}} C\_{\varepsilon}^{-1}(\mathbf{y}-Fz) \right] \tag{1.16}$$

fpostð Þz is a multigaussian function with the mean vector

$$m\_{\rm pot} = m + CF\left(FCF + C\_{\varepsilon}\right)^{-1}(\mathbf{y} - F\,m) \tag{1.17}$$

and the covariance matrix

$$C\_{post} = C - CF\left(FCF' + C\_{\varepsilon}\right)^{-1} FC \tag{1.18}$$

#### 1.2.5.2 Kriging and Bayesian Inversion

Equation 1.17 can also be written

$$m\_{\rm post} = m + \Lambda(\mathbf{y} - Fm) = (I - \Lambda F)m + \Lambda \mathbf{y} \tag{1.19}$$

with

$$
\Lambda = \boldsymbol{C} \boldsymbol{F}^{\prime} \left( \boldsymbol{F} \boldsymbol{C} \boldsymbol{F}^{\prime} + \boldsymbol{C}\_{\varepsilon} \right)^{-1} \tag{1.20}
$$

In can be checked that Λ is also the p × n matrix giving at each line u the n simple Kriging (or error coKriging) weights associated with the Kriging of the value zu at node u. Comparing the first part of Eq. 1.19 with Eq. 1.2 shows that, in the multigaussian case, mpost is equal to simple Kriging and that the matrix Cpost contains the variances and covariances of simple Kriging at each node u of the regular grid.

### 1.2.6 Energy-Based Versus Probabilistic Estimates

The minimization of Eq. 1.14 leads to either Kriging or splines if the (inverse of) the covariance (Kriging) and the differential operator (splines) are properly chosen. Minimizing the expression in Eq. 1.14 is equivalent to maximizing

$$\exp\left(-\left(\sum\_{\mu,\nu=1}^{p} z\_{\mu} B\_{\mu\nu} z\_{\nu} + \sum\_{i=1}^{n} \frac{\left(z\_{i} - \mathbf{y}\_{i}\right)^{2}}{C\_{v\_{i}}}\right)\right) \tag{1.21}$$

This is also the expression (up to a multiplicative constant) of the conditional multivariate distribution in the multigaussian case, as given by Bayes theorem (Eq. 1.16), in the case where m = 0, where the matrix C<sup>ε</sup> is diagonal and where the data are point values. The first term represents the prior pdf and the second the likelihood function. Kriging which is equal to the mean of the posterior pdf, also maximizes this pdf in the multigaussian case.

Expression (1.21) relates the world of energy functionals (such as splines) with that of probability functions (such as Kriging). More generally regularization and maximum a posteriori Bayesian estimates are identical if the prior covariance used in Bayesian inversion is properly chosen. The equivalence between an energy function and a probability distribution is also used in statistical mechanics, as the probability of a particular configuration is inversely related to its energy. Suppose that the vector z minimizes an energy functional E zð Þ. Using the results of Geman and Geman (1984), Szeliski and Terzopoulos (1989) associate a probability to this energy through the Boltzmann (or Gibbs) distribution p zð Þ defined as

$$p(z) = \frac{1}{Z} \exp\left(-\frac{E(z)}{T}\right) \tag{1.22}$$

where Z and T are positive constants. If Bayes' theorem is applied to the above prior pdf p zð Þ and the posterior pdf is maximized, the formalism of splines is obtained.

### 1.2.7 Conclusion on Kriging

Three different ways of calculating a Kriging interpolator have been discussed


Kriging, although derived using a probabilistic formalism, is still a deterministic technique, in the sense that one unique or "best" model is produced, In most cases, Kriging provides a representation that is very smooth. As a result the application of non- linear operators to Kriged models will provide biased results (Dubrule 2003). This is one of the reasons for the success of conditional simulation.

### 1.3 Stochastic Aspects of Geostatistics: Conditional Simulation

With conditional simulation, the approach is stochastic. A large number of realizations are generated, which match the data (if the simulation is conditional) and share the first (mean) and second order (stationary covariance or generalized covariance) moments of the modeled random function. The main benefit of conditional simulation is that it produces realizations that behave away from the well data the same way as the well data themselves (Dubrule 2003). This is not true with Kriging, which produces a model that is smoother away from the wells than it is at the wells.

Conditional simulation can also be regarded as a technique for generating realizations of the conditional multigaussian pdf fully characterized by Eqs. 1.17 and 1.18. In other words, the realizations "vibrate" around their Kriging mean with a variance at each location equal to the Kriging variance.

A number of conditional simulation algorithms have been developed (Chilès and Delfiner 2012, p. 478). Among them, two are routinely used in the petroleum industry and are particularly interesting in relation with the inversion of seismic and production data.

### 1.3.1 Method 1: "Smooth Plus Rough" or "Rough Plus Smooth" Algorithm

Zð Þx can be simply written as the sum of Kriging plus the Kriging error

$$Z(\mathbf{x}) = Z\_k(\mathbf{x}) + (Z(\mathbf{x}) - Z\_k(\mathbf{x})) \tag{1.23}$$

The "smooth plus rough" (Oliver 1996) simulation method writes a conditional simulation Zcsð Þx as the sum of Kriging plus a simulation of the Kriging error. A non-conditional simulation Zncsð Þ<sup>x</sup> of <sup>Z</sup>ð Þ<sup>x</sup> is generated first, which honors the mean and the covariance of Zð Þx , then the conditional simulation Zcsð Þx is calculated as

$$Z\_{cs}(\mathbf{x}) = Z\_k(\mathbf{x}) + (Z\_{mcs}(\mathbf{x}) - Z\_{mcsk}(\mathbf{x})) \tag{1.24}$$

where Zncskð Þx designates Kriging of Zncsð Þx using as data the values Zncsð Þ x<sup>i</sup> of the non-conditional simulation at the conditioning data locations. Thus to the smooth term Zkð Þ<sup>x</sup> is added the rough term ð Þ Zncsð Þ<sup>x</sup> <sup>−</sup> Zncskð Þ<sup>x</sup> . Chilès and Delfiner (2012, p. 495) show that Zcsð Þx honors the data and has the same (generalized) covariance as Zncsð Þx (and hence as Zð Þx ).

Equation 1.24 can also be expressed in the form of a "rough plus smooth" equation

$$Z\_{cs}(\mathbf{x}) = Z\_{ncs}(\mathbf{x}) + (Z\_k(\mathbf{x}) - Z\_{ncsk}(\mathbf{x})) \tag{1.25}$$

Using Eq. 1.17, Eq. 1.25 can be written in the discrete case, assuming that the data are average values of the gridded values and are affected by a measurement error. At location u of the discrete grid

$$z\_{ucs} = z\_{uncs} + CF' \left( FCF' + C\_e \right)^{-1} (\text{y} - Fz\_{uncs}) \tag{1.26}$$

Equation 1.26 shows that conditional simulation is obtained by adding to a non-conditional simulation a Kriging of the mismatch ð Þ y − Fzuncs between the data and the unconditional simulation at the data location. This formalism will appear to be quite general and will facilitate the understanding of the relationship between conditional simulation and Kalman Filtering.

### 1.3.2 Method 2: Sequential Gaussian Simulation (SGS)

SGS (Deutsch and Journel 1998) is probably the most popular and flexible conditional simulation technique used in applications. SGS works under the multigaussian assumption and sequentially draws random locations within the simulated grid. At each new random location, the value is first Kriged from the previously simulated values and the well data. Then, a random value is sampled from the Gaussian pdf with mean equal to the Kriged value and variance equal to the Kriging variance (SGS uses the property that, in the multivariate normal case, univariate conditional distributions are also Gaussian). Then the sampled value is merged with the rest of the dataset, and a new random location is chosen within the simulated grid. The grid points where a data point is present are treated the same way as grid points with no data if the error ε affecting the data is different from zero. If all the data are exact, then the grid nodes with data points are left unchanged. The result is a Gaussian realization constrained by the data values and satisfying the input statistics (mean and covariance function).

The main difference between "rough plus smooth" and SGS is that SGS works sequentially, grid point by grid point. The sequential nature of SGS is well suited to the geostatistical inversion of seismic data. Indeed, at each grid node, the sequential approach can make sure that the sampled value is compatible with both the previously generated points and the seismic data at the same location, thus combining the advantage of single trace inversion with that of spatial coupling. This will be discussed in Sect. 1.4.

### 1.3.3 Spectrum and Conditional Simulation

Since the frequency spectrum is the Fourier transform of the covariance (Chilès and Delfiner 2012, p. 66), the spectrum of a conditional simulation is the same as that of the data. Conditional simulation addresses the following statement from Claerbout (2002) about seismic data interpolation: Of all the assumptions we could make to fill empty bins, one that people usually find easiest to agree with is that the spectrum should be the same in the empty-bin regions as where bins are filled.

Claerbout (2002) also defines the Prediction Error Filter (PEF) as the linear operator T that transforms the data into a white noise. In other words, T<sup>0</sup> T is the inverse of the covariance. Based on Eq. 1.11, this also means that T is the spline operator associated with the covariance of the data. Claerbout (2002) shows that unconditional simulations can be generated by applying T <sup>−</sup><sup>1</sup> to a white noise. This is the same technique as that used by Oliver (1988) and Oliver (1995) who applies what he calls the square root of the covariance function to a white noise.

### 1.4 Geostatistical Inversion of Seismic Data

### 1.4.1 Deterministic Seismic Inversion

Until the mid-nineties or so, most seismic inversion studies were deterministic, in the sense that they generated a single "best" model, usually at the same resolution as the seismic data. Often, regularization-based or Bayesian methods were used, which led to the generation of one "maximum posterior" or "optimal for a given norm (often L2)" 3-D acoustic impedance model (Tarantola 2005).

If the seismic inversion problem is linearized as with Fatti et al.'s (1994) model, the reflection coefficient <sup>r</sup>ð Þ<sup>θ</sup> at seismic time <sup>t</sup> for a seismic block of offset <sup>θ</sup> can be written

$$r(\theta) = a\_1(\theta) \frac{\partial LogI\_p(t)}{\partial t} + a\_2(\theta) \frac{\partial LogI\_s(t)}{\partial t} \tag{1.27a}$$

$$\mathbf{a} \mathbf{a} \mathbf{d} \quad \mathbf{y}(\theta) = \mathbf{w}(\theta)^\* r(\theta) + \varepsilon(\theta) \tag{1.27b}$$

where Ipð Þt and Isð Þt are the compressive and shear impedances at time t, a1ð Þθ and a2ð Þθ are offset-related parameters, wð Þθ is the seismic wavelet for offset θ and ε θð Þ is noise. This model is linear in the logarithm of Ipð Þt and Isð Þt . Thus, as long as the logarithms of impedances are inverted, the seismic amplitudes can we written as in Eq. 1.15 as a linear function of the logarithms of impedances, and the posterior mean obtained by multigaussian Bayesian seismic inversion (Eqs. 1.17 and 1.18) is identical to Kriging. The solution can also be regarded as a regularization-based solution, where the norm controlling the smoothness is derived from the inverse of the covariance.

At the time when only deterministic inversion was used, geostatisticians often treated seismic data as "soft" information, making use only of statistical correlations between seismic and reservoir parameters in order to constrain the earth models. This "soft" approach to seismic data allowed the development of some interesting interpolation techniques such as external drift or collocated coKriging (Dubrule 2003). However it also led to reservoir models not fully compatible with the seismic data as, if a seismic forward model such as that of Eq. 1.27 was applied to them, the actual seismic data was not recovered.

The above approaches proved sufficient until the late eighties or so, as seismic data were used at rather large scale. Thanks to the development of 3-D earth modeling at the reservoir scale in the early nineties, it became necessary to work with models at higher resolution than seismic data, and hence to quantify the uncertainty attached to these models. Then the availability of 4D seismic data also called for new technology to better constrain the earth models. Geostatistical inversion, described below, was developed with these issues in mind.

### 1.4.2 Geostatistical Inversion (GI)

The original GI algorithm (Bortoli et al. 1992; Haas and Dubrule 1994) used SGS to simulate high-resolution acoustic impedance traces constrained by seismic data. SGS starts by picking a random cell within a regular two-dimensional grid. At this cell, a large number of possible acoustic impedance vertical traces are generated by SGS, then the trace that best matches the actual seismic trace at this location is selected. Then SGS moves to another random location of the two-dimensional grid, etc. until the whole model is filled with high-resolution impedance traces. Initially SGS appeared to be well suited to this application, as it allowed the use of any kind of forward model—linear or not—relating the acoustic impedance trace generated by SGS to the seismic amplitude trace at the same location. The acoustic impedance vertical traces simulated by SGS typically have higher frequency content than the seismic amplitudes, which makes them non-unique. This uncertainty can be quantified by generating multiple conditional simulations. Unfortunately the use of SGS proved to take too much computer time for large seismic datasets.

By revisiting the above GI algorithm in a Bayesian framework and in the linear context of Fatti's model (Eq. 1.27), authors such as Buland and Omre (2003) or Escobar et al. (2006) not only clarified the GI formalism but also provided a straightforward conditional simulation algorithm based on Eqs. 1.17 and 1.18 which was more efficient then SGS for sampling acoustic impedance traces compatible with seismic amplitudes. Whilst Bayesian inversion provided an expression of the posterior mean and covariance of the impedances multiGaussian pdf, GI allowed the sampling of reservoir-scale impedance realizations from this pdf.

As convincingly shown by Francis (2006a, b) or Escobar et al. (2006), cut-off operations such as those used to translate acoustic impedance into facies can be applied to GI realizations, thus avoiding statistical bias if these cut-offs were applied to Kriging.

### 1.5 Kalman Filtering and Ensemble Kalman Filtering

### 1.5.1 Kalman Filtering (KF)

Suppose that a Gaussian random vector Zt <sup>−</sup><sup>1</sup> has evolved until time ð Þ t − 1 and that Zt <sup>−</sup><sup>1</sup> is an unbiased estimate of the unknown true state vector zt <sup>−</sup><sup>1</sup> at time ð Þ t −1

$$Z\_{t-1} = z\_{t-1} + R\_{t-1} \text{ with } E(R\_{t-1}) = 0 \text{ and } Var(Z\_{t-1}) = Var(R\_{t-1}) = C\_{t-1} \quad (1.28)$$

If the model error is neglected, the forward model relating the true state vector at time ð Þ t −1 with the state vector at time t is assumed to be a linear function Lt

$$z\_1 = L\_t z\_{t-1} \tag{1.29}$$

At time step t, the unknown true state of the system has evolved according to Eq. 1.29 and a vector dt of n new data may also be available. Assume that these data are linear functions of the state vector zt, and can be expressed as in Eq. 1.15

$$d\_t = F\_t z\_t + e\_t \tag{1.30}$$

where the error vector ε<sup>t</sup> has mean zero and covariance matrix C<sup>ε</sup><sup>t</sup> .

KF (Kalman 1960) aims to combine the information provided about zt by the forward model Lt applied to the estimate Zt <sup>−</sup><sup>1</sup> (Eq. 1.29) with the information provided by the data dt (Eq. 1.30). Bayes can be used for this, LtZt <sup>−</sup><sup>1</sup> playing the role of the prior distribution. It is easy to verify that the covariance of the random vector LtZt <sup>−</sup><sup>1</sup> is Ct =LtCt <sup>−</sup><sup>1</sup>L 0 t . Hence, under Gaussian assumptions the best estimate is (from Eq. 1.19)

$$\mathbf{Z}\_t = \mathbf{L}\_t \mathbf{Z}\_{t-1} + \Lambda\_t (d\_t - F\_t L\_t \mathbf{Z}\_{t-1}) \tag{1.31}$$

where the kriging weights matrix Λ<sup>t</sup> (as in Eq. 1.20) is now called the Kalman gain

$$
\Lambda\_t = C\_t F\_t^{'} \left( F\_t C\_t F\_t^{'} + C\_{e\_t} \right)^{-1} \tag{1.32}
$$

Zt <sup>−</sup><sup>1</sup> as defined in Eq. 1.28 can represent any kind of unbiased estimate based on all the information available at time ð Þ t − 1 . Kriging and conditional simulation are both unbiased estimates of zt, only their variance is different and is of course minimum if Zt <sup>−</sup><sup>1</sup> is Kriging and larger if Zt <sup>−</sup><sup>1</sup> is simulation. Chilès and Delfiner (2012) (p. 497) show that the variance of the difference between a random function and its conditional simulation is twice the Kriging variance. In case Zt <sup>−</sup><sup>1</sup> is simulation, Eq. 1.31 looks like the "rough plus smooth" method (Eq. 1.26) with LtZt <sup>−</sup><sup>1</sup> playing the role of the non conditional simulation. Equation 1.31 makes the estimate LtZt <sup>−</sup><sup>1</sup> conditional to the new data dt by adding an interpolation of the mismatch between FtLtZt <sup>−</sup><sup>1</sup> and the data.

In standard geostatistical applications, the observations are often spatial and hence assimilated simultaneously, while KF processes information sequentially, time step after time step. Tarantola (2005) (in Appendix 6.18) shows that, if in a linear least-squares problem the dataset can be divided into subsets with zero covariance between them, then solving one global inverse problem is equivalent to solving a series of smaller problems using the posterior state and covariance matrix of each partial problem as prior information for the next. Oliver et al. (2008) also show (in Chap. 11) that, under the same assumptions as Tarantola (2005), the step by step computation of KF provides (in the multigaussian case) the same result as would be obtained by integrating all the data in one single step. In the case where Lt is the identity function, these two results also imply that simple Kriging would provide the same result if data were incorporated sequentially into the Kriging system, or in one single batch (under the assumptions that each batch of data has zero covariance with the others).

### 1.5.2 Constraining Reservoir Models by Production Data

Fluid flow models are strongly non-linear, and linear approximations such as those already discussed for seismic modeling or KF cannot be used.

A distinction must be made between "history-matching", where a single reservoir model is modified until the flow simulation matches the production data, and "constraining reservoir models by production data", where reservoir model realizations compatible with production data are generated. Here the discussion focuses on the second objective rather than the first one. Some techniques to address this objective are based on rigorous approaches such as Markov-Chain Monte Carlo (MCMC) or Genetic Algorithms (GA) (Oliver et al. 2008). But these are very time-consuming and often unpractical. Ensemble Kalman filtering appears to be a more practical approach for incorporating production data into the reservoir model.

### 1.5.3 Ensemble Kalman Filtering (EnKF) Versus Conditional Simulation

EnKF (Evensen 2007; Oliver et al. 2008) starts with an ensemble of initial realizations that are not constrained by production data. Typically the state vector zt at time t contains permeabilities, porosities, saturations, pressures, and thermodynamic variables at the simulator grid nodes followed by a vector of predicted production data at each well i at time t.

The following notation is used for a given state vector zt

$$z\_t = \left(z\_{ut}^1, z\_{ut}^2, \dots, z\_{ut}^k, q\_{it}^{1^\*}, \dots, q\_{it}^{l^\*}\right) \tag{1.33}$$

It is assumed that there are k gridded variables in zt, that the simulator grid is composed of p cells u, and that there are n wells i each with l new production data at time t. The total size of the state vector zt is kp +nl. The predicted data vector d\* <sup>t</sup> is the vector of size nl

$$d\_t^\* = \left(q\_{1t}^{1^\*}, \dots, q\_{nt}^{1^\*}, q\_{1t}^{2^\*}, \dots, q\_{nt}^{2^\*}, \dots, q\_{1t}^{l^\*}, \dots, q\_{nt}^{l^\*}\right) \tag{1.34}$$

The relation between state vector and predicted data is

$$\left.d\_t^\* = P\mathbb{Z}\_t \text{ with } P = \left(O\_{nlxkp}, I\_{nlxnl}\right) \tag{1.35}$$

P is a nl × ð Þ kp+ nl matrix. The function ft, which represents the flow simulator, is non-linear. If the model errors are neglected

$$z\_t = f\_t(z\_{t-1}) \tag{1.36}$$

does not modify the rock properties (unless they are affected by changes in pressure and saturation), but replaces the pressure, saturation, and simulated data with new values at time t.

The problem is now to calculate the best estimate of the state vector zt combining the information provided by the flow simulation forward model ftð Þ zt <sup>−</sup><sup>1</sup> and that provided by the new data dt.

If ft is a linear function, this is the standard KF domain of application and Eq. 1.31 applies, Lt playing the role of ft. But now ft is non-linear. It would still be convenient to update the state vector through a generalization of Eq. 1.31

$$z\_{\mathcal{L}} = f\_l(z\_{\mathcal{L}-1}) + \Lambda\_l(d\_l - Pf\_l(z\_{\mathcal{L}-1})) \tag{1.37}$$

where the Kalman gain Λ<sup>t</sup> is obtained using Eq. 1.32. Assuming that there is no error associated with the data, Eq. 1.32 can be simplified into

$$
\Lambda\_l = C\_l P^\prime \left( P C\_l P^\prime \right)^{-1} \tag{1.38}
$$

Equation 1.38 requires the knowledge of the covariance Ct of ftð Þ zt <sup>−</sup><sup>1</sup> , in other words the covariance of the image of the state vector after application of the flow simulation model ft. ft is non-linear and this covariance cannot be simply calculated —as in the linear case—from the covariance at the previous step. EnKF addresses this issue by statistically deriving this covariance using the information from the multiple realizations, typically about a hundred of them. This is the key idea behind EnKF.

There are of course a number of issues resulting from the fact that the covariances are calculated from a finite number of realizations of the ensemble. The first one is spurious correlation, because the ensemble members are not independent except in the starting ensemble. The second one is that if the number of realizations in the ensemble is not large enough, then the covariances are poorly estimated. Standard geostatistics addresses this by fitting mathematical models to the experimental covariances, in order to smooth the spurious correlations.

### 1.5.4 Ensemble Kalman Filtering and Its Relationship with CoKriging

In Eq. 1.37, focus now on the rock properties in the state vector. ftð Þ zt <sup>−</sup><sup>1</sup> leaves the rock properties unchanged, as only the time-dependent state vectors in the simulator grid are calculated by one time-step of the flow simulator, whilst Λ<sup>t</sup> dt − Pf ð Þ <sup>t</sup>ð Þ zt <sup>−</sup><sup>1</sup> is a linear combination of the differences between observed and predicted production data at each well. Thus EnKF interpolates between the wells by calculating a linear combination of these differences across the field, then adds these interpolated difference to the rock properties model. Is it possible to reformulate EnKF as a well by well geostatistical approach?

The term Λ<sup>t</sup> in Eq. 1.37 is the Kalman gain as given by Eq. 1.38. In the case where there is no error affecting the data, Eqs. 1.37 and 1.38 can be written

1 Kriging, Splines, Conditional Simulation, Bayesian Inversion … 21

$$z\_t - f\_t(z\_{t-1}) = C\_t P \left( P C\_t P \right)^{-1} (d\_t - P f\_t(z\_{t-1})) \tag{1.39}$$

The left-hand side is the update calculated by EnKF for the property of interest as the time step evolves from ð Þ <sup>t</sup> <sup>−</sup> <sup>1</sup> to <sup>t</sup>. The Kalman gain coefficients of the right-hand side are nothing else than the simple coKriging weights (see for instance Chilès and Delfiner 2012, p. 303).

Thus, each estimate of a 3-D spatial parameter such as porosity or permeability at time ð Þ t −1 is updated at time t by a linear combination of all the inconsistencies generated by this parameter at the data points. Since, in the case of flow simulations, many parameters are involved in the production profiles prediction, all the individual parameters' 3-D models must be corrected in a consistent way, which is why multivariate coKriging—and not univariate Kriging applies here.

### 1.6 Beyond the Formal Relationship Between Geostatistics and Bayes

### 1.6.1 Two Identical Formalisms but Different Assumptions

The above developments show that techniques such as conditional simulation, Bayesian inversion, geostatistical inversion and ensemble Kalman filtering follow a similar mathematical formalism.

However, their philosophy of application differs in the way the covariance is approached. This can be understood by looking again as Bayes rule as presented in Eq. 1.16

$$\Box f\_{\rm post}(z) \text{sg} f\_{\rm proj}(z) \text{g} \left( \mathbf{y}/z \right) \tag{1.40}$$

With geostatistics, the experimental (generalized) covariance calculated on the data y is fitted by a model which becomes the covariance of the unconditional distribution fprioð Þz . Then the data y are used a second time through the simulation conditioning process of Eq. 1.26.

With Bayes, the covariance model associated with fprioð Þz is a prior based on local or analog knowledge, but not on the data themselves (Tarantola 2005). This prior is transformed into a posterior covariance through the conditioning process of Eq. 1.40.

With geostatistics, the aim of conditional simulation is to generate realizations that match the data and satisfy the input covariance; the SGS and rough plus smooth algorithms work only if the data themselves satisfy this input covariance. But the random function Zcsð Þx of Eqs. 1.24 and 1.25 is not an ergodic or even a stationary random function; its variance at each location x is equal to the Kriging variance and changes with x, as it is zero at the data points. In other words, the covariance of the random function Zð Þx is different from that of Zcsð Þx conditionally to the data (Chilès and Delfiner 2012, p. 497). But the covariance calculated on a single conditional realization does not "see" any difference between the grid cells associated with data points and those not associated with data points. It is only as the realizations change, leaving the data unchanged, that the covariance across realizations appears non-stationary and hence non-ergodic.

On the other hand, Bayes combines a prior covariance—usually different from that of the data—with a data-based likelihood, resulting into a posterior pdf that sits somewhere between the prior and the likelihood. Bayes updates prior covariances based on new data whilst conditional simulation anchors the realizations against the hard data (Escobar, personal communication).

### 1.6.2 Model Falsifiability

Tarantola (2006) challenges the geostatistical and Bayes formalisms if models are to be falsifiable or have a scientific meaning: I suggest that the setting, in principle, for an inverse problem should be as follows: use all available prior information to sequentially create models of the system, potentially an infinite number of them. For each model, solve the forward modeling problem, compare the predictions to the actual observations and use some criterion to decide if the fit is acceptable or unacceptable, given the uncertainties in the observations and, perhaps, in the physical theory being used. The unacceptable models have been falsified, and must be dropped. The collection of all the models that have not been falsified represent the solution of the inverse problem. Thus, Tarantola (2006) offers to keep all the prior realizations that are compatible with the data. Thus the data are used to validate or reject the prior realizations, rather than update the prior pdf into the posterior.

### 1.6.3 Looking Ahead: Machine Learning and Falsifiability

The fast growth in machine learning algorithms (Goodfellow et al. 2016) is challenging the geostatistical and Bayesian formalisms in situations where data are plenty. Thanks to this large number of data, the approach used to falsify a convolutional neural network model (for instance) relating input parameters to data is often to test whether the convolutional model works as well on a training (or calibration) dataset as on a test dataset not used for training. The prior model itself is completely data-driven, which contradicts Tarantola (2006) but the validation step is along the lines of his above recommendations! This topic is likely to generate interesting discussions in the future.

### 1.7 Conclusion

The objective of this chapter was to discuss the convergence observed over the last fifty years between geostatistics and other modelling and inversion techniques.

A formal convergence exists between the main techniques used to constrain reservoir models by multi-disciplinary data. Kriging, splines, conditional simulation, geostatistical inversion and ensemble Kalman filtering can be interpreted using either the geostatistical formalism or Bayes.

Most of these techniques amount to the same approach where an initial model is updated by using a linear combination of the mismatches between the new data and their prediction from the initial model (Eqs. 1.19, 1.26, 1.31 and 1.39).

However the methods above have a different philosophy towards the inference of the covariances used in these calculations. Bayes uses the data to update a prior pdf which is independent of the data. Geostatistics generate realizations of conditional simulations that reproduce the modeled covariance—or the spectrum—of the data. EnKF does not model a covariance but directly uses the empirical covariances derived from the ensemble realizations and their flow simulations.

Acknowledgements The author would like to thank Imperial College London, and Total for seconding him there as a Visiting Professor. Igor Escobar and Danila Kuznetsov are thanked for the fruitful discussions held at the Total Geoscience Research Centre in Aberdeen (UK). The author will also never forget the passionate and illuminating discussions with Albert Tarantola at the Café Beaubourg in Paris!

### References


Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **Chapter 2 A Statistical Commentary on Mineral Prospectivity Analysis**

**Adrian Baddeley**

**Abstract** We compare and contrast several statistical methods for predicting the occurrence of mineral deposits on a regional scale. Methods include logistic regression, Poisson point process modelling, maximum entropy, monotone regression, nonparametric curve estimation, recursive partitioning, and ROC (Receiver Operating Characteristic) curves. We discuss the use and interpretation of these methods, the relationships between them, their strengths and weaknesses from a statistical standpoint, and fallacies about them. Potential improvements and extensions include models with a flexible functional form; techniques which take account of sampling effort, deposit endowment and spatial association between deposits; conditional simulation and prediction; and diagnostics for validating the analysis.

### **2.1 Introduction**

The pioneering work of Agterberg (1974) developed a statistical strategy for predicting the likely occurrence of mineral deposits. In essence, the observed association between known deposits and other known geostructural or geochemical information is used to predict the spatially-varying abundance of unknown deposits. The association between predictors and deposits is modelled by logistic regression.

This general approach to prospectivity analysis has been extended and adopted across a wide range of applications, for predicting mineral deposits (Chung and Agterberg 1980; Bonham-Carter 1995), archaeological finds (Scholtz 1981; Kvamme 1983), landslides (Chung and Fabbri 1999; Gorsevski et al. 2006), animal and plant species (Franklin 2009) and other features which can be treated as points at the scale of interest. Extensions and modifications include logistic regression for sampled data, maximum entropy, and weights-of-evidence modelling.

However, the scientific literature contains many conflicting statements about the interpretation of these methods. For example, there are different understandings of

A. Baddeley (✉)

Department of Mathematics and Statistics, Curtin University, GPO Box U1987, Perth, WA 6845, Australia e-mail: adrian.baddeley@curtin.edu.au

<sup>©</sup> The Author(s) 2018

B. S. Daya Sagar et al. (eds.), *Handbook of Mathematical Geosciences*, https://doi.org/10.1007/978-3-319-78999-6\_2

the fundamental scope and validity of logistic regression, about the degree of flexibility inherent in the assumptions, and about the interpretation of the results. This is a concern, because misunderstanding of a statistical technique poses the obvious risk that it may be mis-applied, its results misinterpreted, or its performance incorrectly evaluated.

In statistical science the understanding of these techniques has also changed dramatically over the last four decades. The modern synthesis of statistical modelling permits a new and deeper appreciation of prospectivity methods. New tools from statistical science may enable exploration geologists to perform a more searching analysis of their survey data.

Accordingly, this article offers a commentary and critique of prospectivity analysis from the standpoint of modern statistical methodology. We begin by examining the fundamentals of logistic regression, explaining the interpretation of the results, and discussing its strengths and weaknesses. We explain the close relationship between logistic regression, point process modelling, and maximum entropy methods. We canvas some alternative methods which are less well known, including monotone regression, nonparametric regression, recursive partitioning models, and ROC curves. (The popular weights-of-evidence method is not discussed here, but will be treated in detail in another article.) New tools include robust estimation, model selection and variable selection, conditional prediction and model diagnostics. Several unanswered questions in prospectivity analysis are identified as topics for future research in statistical methodology.

### **2.2 Example Data**

For the sake of demonstration and discussion, we shall use a vastly oversimplified example. The Murchison geological survey data shown in Fig. 2.1 record the spatial locations of gold deposits and associated geological features in the Murchison area of Western Australia. They are extracted from a regional survey (scale 1:500,000) of the Murchison area made by the Geological Survey of Western Australia (Watkins and Hickman 1990). The features shown in the Figure are the known locations of gold deposits, the known or inferred locations of geological faults, and greenstone outcrop. The study region is contained in a 330 × 400 km rectangle. At this scale, gold deposits are point-like, i.e. their spatial extent is negligible. These data were previously analysed in Foxall and Baddeley (2002), Brown et al. (2002); see also Groves et al. (2000), Knox-Robinson and Groves (1997). Data were kindly provided by Dr. Carl Knox-Robinson, and permission granted by Dr. Tim Griffin, Geological Survey of Western Australia and by Dr. Knox-Robinson.

Evidently, both the geological fault pattern and the greenstone outcrop have "predictive" value for gold prospectivity, because gold deposits are strongly associated with proximity to both features. For the purposes of analysis in this article, we require predictors to be spatial variables. A predictor *Z* should be a function *Z*(*u*) defined at any spatial location *u*. For a map of rock type such as the greenstone outcrop, the simplest choice for the predictor value *Z*(*u*) at location *u* is the "indicator" equal to 1 if the location *u* falls inside the greenstone, and 0 if it falls outside. For a map of linear features such as geological faults, a common choice for the predictor value *Z*(*u*) is the distance from *u* to the nearest fault. Figure 2.2 shows contours of this distance function for the Murchison data.

It is important to note that our choice of spatial predictor *Z*(*u*) will affect the results of the analysis: the results would usually be different if we replace the distance function in Fig. 2.2 by the squared distance or the square root of distance, etc. Several other choices of spatial predictor derived from geological fault data are canvassed in Berman and Turner (1992). Likewise for the greenstone outcrop we could have chosen another predictor, such as the distance function of the greenstone. The choice of predictor can be revisited after the analysis, as discussed in Sect. 2.4.6.

#### **Fig. 2.2** Contours of distance to the nearest fault in the Murchison survey

### **2.3 Logistic Regression**

Here we recapitulate and re-examine some details of the logistic regression technique, for the purposes of discussion.

### *2.3.1 Basics of Logistic Regression*

Logistic regression is a general statistical technique for modelling the relationship between a binary response variable and a numerical explanatory variable (Berkson 1955; McCullagh and Nelder 1989; Dobson and Barnett 2008; Hosmer and Lemeshow 2000). The use of logistic regression to predict the presence/absence of point events was pioneered in geology by Agterberg (1974, 1980), apparently on the suggestion of the statistician Tukey (1972): see Agterberg (2001). The study region is divided into pixels; in each pixel the presence or absence of any deposits is recorded; then logistic regression is used to predict the probability of the presence of a deposit as a function of predictor variables. This was later independently rediscovered in archaeology (Scholtz 1981; Hasenstab 1983; Kvamme 1983, 1995) and is now a standard technique in GIS applications (Bonham-Carter 1995) including spatial ecology (Franklin 2009).

The study region is divided into pixels of equal area. For each pixel, we record whether mineral deposits are present or absent. We then fit the *logistic regression* relationship

$$
\log \frac{p}{1-p} = a + \beta z \tag{2.1}
$$

where *p* is the probability of presence of a deposit (or deposits) in a given pixel, and *z* is the corresponding value of the predictor variable.

Here and are model parameters which are estimated from the data. Some writers state that the interpretation of and is "obscure" (Wheatley and Gillings 2002, p. 175), perhaps because of the unfamiliar form of the left hand side of (2.1). The quantity *p*∕(1 − *p*) is the *odds* of presence against absence, that is, the probability *p* of presence of a deposit, divided by the probability 1 − *p* of absence. The left hand side of (2.1) is the logarithm of the odds of presence. (In this paper 'log' always refers to the natural logarithm, with base *e*.) The logistic regression relationship (2.1) states that the *log odds* of presence is a linear function of the predictor *z*. The straight line has slope and intercept . The transformation log(*p*∕(1 − *p*)) ensures that

$$p = \frac{e^{a + \beta z}}{1 + e^{a + \beta z}}\tag{2.2}$$

is a well-defined probability value (between 0 and 1) for any possible values of *,*  and *z*. The log odds is the "canonical" choice of transformation in order to satisfy some desirable statistical properties (McCullagh and Nelder 1989), and arises naturally in many applications. Bookmakers often quote gambling odds that are equally spaced on a logarithmic scale, such as the sequence 2:1, 4:1, 8:1, 16:1. Since logistic regression is widely used in medical and public health research, standard statistical textbooks contain many useful ways to interpret and explain these quantities (Hosmer and Lemeshow 2000).

Once the parameters *,*  have been estimated from data (as detailed in Sect. 2.3.3), the predicted probabilities *pj* can be computed using (2.2) and displayed as colours or greyscales in a pixel image, as shown in Fig. 2.3. Qualitative interpretation of the map seems to be adequate for many purposes, while many writers recommend using only the *sign* of the slope parameter (Gorsevski et al. 2006, pp. 405–407). However, much more can be done with the fitted logistic regression, as we discuss below.

The general appearance of Fig. 2.3 is very similar to the contour plot of distance to nearest fault in Fig. 2.2. This is a foreseeable consequence of the simple model (2.1) which implies that contours of probability are contours of distance to nearest fault. This is not true of more complicated models involving several predictors.

**Fig. 2.3** Fitted probability of a gold deposit in each 10-km-square pixel in the Murchison survey, estimated by logistic regression. Pixel values are probabilities (between 0 and 1)

### *2.3.2 Flexibility and Validity*

Some writers describe logistic regression as a 'nonparametric' technique (Kvamme 2006, p. 24), which would suggest that it is able to detect and respond to any kind of relationship, not specified in advance, between the predictor *z* and the presence probability *p*. On the contrary, logistic regression is a parametric model of a very simple kind. The relationship *z* and *p* is rigidly defined by Eqs. (2.1) and (2.2): *the relationship is linear* on the scale of the log odds. The position of the line is determined by the two parameters and . Logistic regression could be *false* for a particular application: that is, the model assumptions could be incorrect.

Logistic regression is an example of a "generalised linear model" (McCullagh and Nelder 1989; Dobson and Barnett 2008), essentially a linear regression of the transformed probabilities against the predictor. In the analysis of the Murchison data shown here, if we replace the distance function *Z*(*u*) by its square *Z*(*u*) <sup>2</sup>, or square root √*Z*(*u*), etc. in the logistic regression, we obtain a different model, which is incompatible with the original model. If the log odds are a linear function of squared distance, then they are *not* a linear function of distance. Consequently, the choice of predictor variable is very important, and it involves an implicit *model assumption* about the relationship between presence probability *p* and predictor *z*. Even the sign of the fitted slope parameter could be misleading if the predictor was chosen incorrectly.

Such freedom as does exist in the logistic regression model is the freedom to choose the predictor or predictors *Z*(*u*). Once the predictor is chosen, the model becomes rigid. If there is concern about the form of relationship between *p* and *Z*, one simple strategy is to fit a polynomial, instead of linear, relationship between the log odds and the predictor variable.

Statistical science has developed an armory of techniques for "validating" a regression analysis (Harrell 2001; Hosmer and Lemeshow 2000). These include diagnostics for checking the validity of the logistic regression relationship (2.1), measures of sensitivity of the fitted model to the data, techniques for selecting the most important variables and the most informative models, and measures of goodness-of-fit. As far as the author is aware, these techniques are rarely used in geoscience. This presents the risk of failing to detect situations where logistic regression analysis is not appropriate. Model validation is a kind of "due diligence" for data analysts.

A weakness of all parametric modelling is that, because of its "low degrees of freedom", the model predictions at a given location are heavily influenced by the entire dataset, including data observed under very different conditions. In the Murchison example, the predicted probability of presence of a gold deposit declines dramatically between distances 0, 1 and 2 km from the nearest fault. This is not necessarily a reflection of the observed frequency of occurrence of gold deposits at these distances: rather, it is a consequence of the large negative value of the estimated slope parameter , which arises because of the scarcity of gold deposits at much larger distances.

Extension of the logistic regression technique to account for characteristics of the mineral deposits, such as total endowment of gold, would be problematic because it would effectively require a model for the probability distribution of the endowment (and this might also be spatially-varying). However, it is straightforward to apply logistic regression to different subsets of the deposits, for example to predict the occurrence of deposits with endowment exceeding a specified threshold.

The logistic regression technique described here assumes that the relationship (2.1) holds throughout the study region, with the same parameter values *,*  throughout. This assumption can be avoided using geographically-weighted logistic regression (Lloyd 2011) or local likelihood estimation (Loader 1999; Baddeley 2017) which allow the parameters to be spatially-varying.

### *2.3.3 Fitting Procedure and Implicit Assumptions*

For the discussion it will be important to know a few details about the procedure that is used to fit the logistic regression relationship.

Suppose there are *N* pixels, with covariate values*z*1*,*…*,zN* respectively, and pixel presence/absence indicators *y*1*,*…*, yN* respectively, where *yj* = 1 if the *j*-th pixel contains a mineral deposit, and *yj* = 0 if not. The goal is to fit a relationship of the form (2.2). This is not a simple matter of curve-fitting, because the data (*zj , yj* ) do not lie "along" or "near" the curve in any sense. See Fig. 2.4. Instead, it is necessary to specify a measure of closeness or agreement between the curve and the observed data: the model is fitted by choosing the parameter values *,*  which make this agreement as close as possible.

The classical fitting method is *maximum likelihood*. Given the data *y*1*,*…*, yN* and *z*1*,*…*,zN*, define the *likelihood L*(*,* ) to be the theoretical probability of obtaining the observed pattern of outcomes (*y*1*,*…*, yN*), as a function of the unknown parameter values and . The likelihood is a measure of agreement between the logistic regression curve and the observed data.

To find the likelihood, first consider a single pixel *j* where *j* = 1*,* 2*,*…*,N*. The probability of obtaining a presence (*yj* = 1) in this pixel is

$$p\_j = \frac{e^{a + \beta z\_j}}{1 + e^{a + \beta z\_j}} \tag{2.3}$$

and the probability of an absence (*yj* = 0) is 1 − *pj* . The likelihood for pixel *j* is the probability of obtaining the *observed* outcome *yj* ,

$$L\_j = \mathbb{P}\{Y\_j = \mathbf{y}\_j\} = \begin{cases} p\_j & \text{if } \mathbf{y}\_j = 1 \\ 1 - p\_j & \text{if } \mathbf{y}\_j = 0 \end{cases}$$

or more compactly

$$L\_j = p\_j^{\mathbf{y}\_j} (1 - p\_j)^{1 - \mathbf{y}\_j} = \left(\frac{p\_j}{1 - p\_j}\right)^{\mathbf{y}\_j} (1 - p\_j)^{\mathbf{y}\_j}$$

which is a function *Lj* = *Lj* (*,* ) of the unknown values of the parameters. Then the full likelihood is the predicted probability of the entire observed pattern of presences and absences (*y*1*,*…*, yN*),

$$L = L\_1 L\_2 \dots L\_N,\tag{2.4}$$

and is a function *L* = *L*(*,* ) of the unknown parameter values *,* . Equation (2.4) assumes that the outcomes in different pixels are statistically independent of each other, because the likelihood is obtained by multiplying likelihood contributions from each pixel. That is, the logistic regression technique, as it is commonly applied to presence/absence data, makes two assumptions:


The (parametric) *maximum likelihood* fitting rule is to choose the values of the parameters *,*  which maximise the likelihood *L*(*,* ). This is a standard procedure in classical statistics, carrying with it many useful additional tools such as standard errors, confidence intervals, and significance tests (Hogg and Craig 1970; Freedman et al. 2007).

Ignoring some pathological cases (e.g. where no deposits are observed), the likelihood is maximised by setting its partial derivatives to zero. Equivalently we may work with the derivatives of log *L*. This yields the *score equations* for logistic regression

$$\sum\_{j=1}^{N} p\_j = \sum\_{j=1}^{N} y\_j \tag{2.5}$$

$$\sum\_{j=1}^{N} p\_j z\_j = \sum\_{j=1}^{N} \mathbf{y}\_j z\_j \tag{2.6}$$

obtained by setting log *L*∕ = 0 and log *L*∕ = 0 respectively. Typically the score equations have a unique solution in (*,* ), giving the maximum likelihood estimates *, ̂ ̂*of the parameters. There are no explicit formulae for *, ̂ ̂*and the score equations must be solved numerically.

The score equations (2.5)–(2.6) have a commonsense interpretation in their own right. In (2.5) the right hand side is the observed number of deposits, while the left hand side is the expected (mean) number of deposits according to the model. In (2.6) the right hand side is the sum of the predictor values at the observed deposits, while the left hand side is the expected (mean) value of this sum according to the model. In this case maximum likelihood is equivalent to the "method of moments" in which parameters are estimated by equating the observed value of a statistic to its theoretical mean value.

Logistic regression is a simple two-parameter model, equivalent to linear regression on a transformed scale. The parameters are estimated using the entire dataset, as shown by Eq. (2.4) or (2.5)–(2.6). Consequently, the presence probability predicted by logistic regression, for a pixel with predictor value *z*, is influenced by data where the predictor value is very different from *z*, as discussed above.

It is not obligatory to use maximum likelihood estimation to fit the logistic regression model. Although maximum likelihood is theoretically optimal if the logistic regression model is true, it may fail if the model is false ("non-robust to mis-specification") and it is sensitive to anomalies in the data ("non-robust against outliers"). Robustness against outliers can be improved using *penalised likelihood* in which the likelihood *L* is multiplied by a term *b*(*,* ) which penalises large parameter values.

### *2.3.4 Pixel Size and Model Consistency*

#### **Dependence on Pixel Size**

The results of a logistic regression analysis clearly depend on the size of the pixels used. Table 2.1 shows estimates of the parameters and in the logistic regression of gold deposits against distance from the nearest fault, in the Murchison data, obtained using different pixel grid sizes. Estimates of the slope parameter are roughly consistent between different grids. The estimate of the intercept parameter becomes lower (more negative) as the pixels become smaller, so that the predicted presence probabilities also become smaller: this is intuitively reasonable, since a smaller pixel must have a smaller chance of containing a deposit.

The score equations help to explain Table 2.1. If the pixel grid is subdivided into a finer grid, the right-hand sides of (2.5) and (2.6) are unchanged, so the left-hand sides must also be unchanged. Since the number of pixels *N* has been increased by the subdivision, the predicted probabilities *pj* must decrease by the same proportion *f* , the ratio of pixel areas in the two grids. Using log(*p*∕(1 − *p*)) ≈ log *p* for small *p*, the estimate of must decrease by approximately log *f* .

In order to make the results approximately consistent between different pixel sizes, the logistic regression (2.1) could be modified to

$$\log \frac{p}{1-p} = \log A + a + \beta z \tag{2.7}$$


**Table 2.1** Fitted logistic regression parameters for Murchison data


**Table 2.2** Fitted logistic regression parameters for Murchison data, adjusted for pixel area

where *A* is the pixel area used. In the language of statistical modelling, the constant log *A* plays the role of an *offset* in the model formula. The resulting, adjusted estimates for the parameters *,*  from the Murchison data are shown in Table 2.2, and they are indeed approximately consistent across different pixel sizes. They could have been obtained from the results in Table 2.1 by subtracting log *A* from the estimates of .

For reasons explained below, slightly better consistency is achieved by replacing logistic regression (2.1) by *complementary log–log regression*

$$\log(-\log(1-p)) = \log A + a + \beta z. \tag{2.8}$$

#### **Large Pixels**

Large pixel sizes are preferred by some researchers. A common justification is that predictions are desired for large spatial regions, for example, the probability that the entire exploration lease contains at least one deposit. Some researchers also feel that small pixel sizes are inappropriate because they lead to tiny probability values, which may be considered physically unrealistic.

However, large pixels are not needed in order to predict the probability of a deposit in a large spatial region *R*. Suppose that a logistic regression model has been fitted using a fine grid of pixels. If the region *R* is decomposed into pixels, the probability *p*(*R*) of presence of at least one deposit in *R* satisfies

$$1 - p(\mathcal{R}) = \prod\_{j \in \mathcal{R}} (1 - p\_j), \tag{2.9}$$

where ∏*<sup>j</sup>*∈*<sup>R</sup>* denotes the product over all pixels in *R*. The left hand side is the probability that there are no deposits in *R*. On the right hand side, (1 − *pj* ) is the probability that there are no deposits in pixel *j*, and since pixel outcomes are assumed to be independent, these pixel absence probabilities should be multiplied together. Hence, *p*(*R*) can be calculated using presence probabilities for a fine pixel grid.

Moreover, the use of large pixels in logistic regression causes difficulties, related to the aggregation of points into geographical areas (Elliott et al. 2000; Waller and Gotway 2004; Wakefield 2007, 2004). The most important of these is the statistical bias due to aggregation ('ecological bias', Wakefield (2004, 2007) or 'aggregation bias', Dean and Balshaw (1997), Alt et al. (2001)). The 'ecological fallacy' (Robinson 1950) is the incorrect belief that a model fitted to aggregated data will apply equally to the original un-aggregated data. The 'modifiable area unit problem' (Openshaw 1984) or 'change-of-support' (Gotway and Young 2002; Banerjee and Gelfand 2002; Cressie 1996) is the problem of reconciling models that were fitted using different pixel sizes or aggregation levels.

Our analysis in Baddeley et al. (2010) shows that aggregation bias is highly dependent on the smoothness of the predictor as a function of spatial location. The distance-to-nearest-fault predictor in the Murchison example, and indeed the distance transform of any spatial feature, is a Lipschitz-continuous function of spatial location, which leads to relatively small aggregation bias. This is illustrated by Table 2.2. However, a predictor which indicates a classification, such as rock type, may have very substantial bias due to aggregation, persisting even at small pixel sizes (Baddeley et al. 2010).

Strictly speaking it can be *impossible* to reconcile two spatial logistic regression models fitted to the same spatial point pattern data using different pixel grids. Two such models are often logically incompatible (Baddeley et al. 2010), because the product rule (2.9) is incompatible with the logistic relation (2.1). It may help to recall that the pixels are artificial. A logistic regression model, using pixels of a particular size, makes an implicit assumption about the spatial random process of points in continuous space. For different pixel sizes, the corresponding assumptions are different, and generally incompatible. There is no random process in continuous space which satisfies a logistic regression model when it is discretised on *every* pixel grid. Two research teams who apply spatial logistic regression to the same data, but using different pixel sizes, may obtain results that cannot be reconciled exactly. This incompatibility can be eliminated by using complementary log–log regression (2.8) instead of logistic regression.

#### **Small Pixels**

Mathematical theory suggests that pixels should be as small as possible, in order to reduce the unwanted effects of aggregation (Baddeley et al. 2010). However, if this is taken literally, several practical problems arise. Small pixel size implies a large number of pixels. Software for logistic regression may suffer from numerical overflow. In a fine pixellation, the overwhelming majority of pixels do not contain a data point, so the overwhelming majority of response values *yj* are zero. This may cause numerical instability and algorithm failure. The standard algorithm for fitting logistic regression, Iteratively-Reweighted Least Squares (McCullagh and Nelder 1989), relies on second-order Taylor approximation of the log likelihood: the algorithm itself may fail when it encounters a numerically singular matrix, or the associated statistical tools may behave incorrectly due to the *Hauck-Donner effect* (Hauck and Donner 1977).

One valid strategy for avoiding these problems is to take only a random sample of the absence-pixels (the pixels with *yj* = 0), and to apply logistic regression to the subsampled data, using an additional offset to adjust for the sampling (Baddeley et al. 2015, Sect. 9.10).

A more natural and comprehensive solution is described in the next section.

### **2.4 Poisson Point Process Models**

Pixels are artificial, so it is reasonable to ask whether logistic regression for pixel data has a well-defined meaning in continuous space, without reference to the pixel grid and pixel size. The appropriate meaning is that of the Poisson point process, studied below.

### *2.4.1 Logistic Regression with Infinitesimal Pixels*

Logistic regressions fitted using different pixel sizes may be logically incompatible, except when the pixel size is very small. Accordingly, the only consistent interpretation of logistic regression is obtained by making the pixels *infinitesimal*.

Infinitesimal pixel size is a mathematical rather than a physical concept; it is comparable to the use of infinitesimal increments d*x* in differential and integral calculus. The practical user will not be required to "construct" infinitesimal pixels; they will exist only in the mathematical theory. Real physical measurements will be expressed as integrals over these infinitesimal pixels.

The presence probability *p* in an infinitesimal pixel will be infinitesimal. A more tangible quantity is the *intensity* or *rate* , loosely defined as the expected number of deposit points per unit area. In a pixel of very small area *A*, at most one deposit point will be present, so the expected number of points is equal to the probability of presence, and we have ≈ *p*∕*A*.

Logistic regression with infinitesimal pixels can be derived heuristically by letting the pixel size tend to zero. A rigorous argument is laid out in Baddeley et al. (2010), Warton and Shepherd (2010a, b). Assume that, for a small enough pixel size, logistic regression holds in the adjusted form (2.7), and that pixel outcomes are independent. Since *p* is small, log(*p*∕(1 − *p*)) ≈ log *p*, so that the logistic regression implies

$$
\log p = \log A + \alpha + \beta z
$$

or equivalently

$$
\log \lambda = a + \beta z.
$$

This gives a consistent limit as pixel area tends to zero. In the limit, the intensity (*u*) at a spatial location *u* is a loglinear function of the predictor,

$$\lambda(u) = \exp(a + \beta \mathcal{Z}(u))\tag{2.10}$$

where *Z*(*u*) is the predictor value at location *u*.

Contrary to the claim that logistic regression is a flexible "nonparametric" model, we conclude that logistic regression is tantamount to assuming a loglinear (exponential) relationship between the density of deposits per unit area and the predictor variable *Z*.

### *2.4.2 Poisson Point Process*

Logistic regression, as commonly applied to presence/absence data, implicitly assumes that pixel outcomes are independent of each other. If independence holds for sufficiently small pixel size then, invoking the classical Poisson limit theorem, the random number of deposits falling in any spatial region *R* must follow a Poisson distribution.

**Definition 1** A random variable *K* taking nonnegative integer values has a Poisson distribution with mean if

$$\mathbb{P}\{K=k\} = e^{-\mu} \frac{\mu^k}{k!} \tag{2.11}$$

for any *k* = 0*,* 1*,* 2*,*….

Consequently (Warton and Shepherd 2010a, b; Baddeley et al. 2010; Renner et al. 2015)

**Theorem 1** *If logistic regression holds in the adjusted form* (2.7) *for sufficiently small pixels, then the random spatial pattern of deposit points must follow a* Poisson point process *with intensity of the form* (2.10)*.*

**Definition 2** The spatial Poisson point process with intensity function (*u*), *u* ∈ ℝ<sup>2</sup> is characterised by the following properties:


$$\mu(B) = \mathbb{E}[n(\mathbf{X} \cap B)] = \int\_{B} \lambda(u) \,\mathrm{d}u;\tag{2.12}$$


$$f(u) = \frac{\lambda(u)}{I} \tag{2.13}$$

where *<sup>I</sup>* <sup>=</sup> <sup>∫</sup>*<sup>B</sup>* (*u*) d*u*.

The intensity function (*u*) completely determines the Poisson point process model. It encapsulates both the abundance of points (by Eq. (2.12)) and the spatial distribution of individual point locations (by Eq. (2.13)). Values of intensity have dimension length−2.

The properties listed above can be used directly to simulate random realisations of the Poisson process. See Daley and Vere-Jones (2003, 2008) for an authoritative treatise on point processes, or (Baddeley et al. 2015, Chaps. 5, 9) for an introduction, and Kutoyants (1998), Møller and Waagepetersen (2004) for further details of statistical theory for point processes.

Theorem 1 establishes a logically consistent, physical meaning in continuous space for the logistic regression model fitted to pixel presence/absence data. Whereas logistic regression models can be somewhat difficult to interpret in practical terms, the infinitesimal-pixel limit of logistic regression is a very simple model, a Poisson point process whose intensity (*u*) depends exponentially (log-linearly) on the predictor *Z*(*u*) through (2.10). This model is well-studied, and permits highly detailed predictions to be made about various quantities, such as the expected number of points in a target region (using **PP2**), the probability of exactly *k* points in a target region (using **PP1**), and the probability distribution of distance from a fixed starting location to the nearest random point.

The conclusion of Theorem 1 remains true in the more general case where the pixel outcomes are weakly dependent on each other (Baddeley et al. 2010, Theorem 3).

From a statistical perspective, the Poisson point process is the fundamental model, while logistic regression is a practical technique for fitting this model approximately on a discretised grid. The connection between them is not a surprise: indeed it is strongly suggested by the standard 'infinitesimal' description of the Poisson point process (Breiman 1968). It is inconceivable that Tukey (1972) was unaware of this connection.

### *2.4.3 Fitting a Poisson Point Process Model*

#### **Fitting Procedures**

We emphasise the distinction between a statistical model and the procedure used to fit the model. The statistical model is a description of both the systematic tendencies and the random variability in the observations, and allows us to make predictions. The model must first be fitted to the observed data. The fitting procedure is not uniquely determined by the model (unless we choose to follow a rule such as maximum likelihood) and there may be several possible choices of procedure, each with its own merits.

The Poisson point process, with loglinear intensity (2.10), has been identified as the relevant model for spatial point pattern data in continuous space. We shall now mention several possible fitting procedures for this model.

First we consider maximum likelihood. Suppose that the observed deposit locations are *x*1*,*…*, xn* in study region *W*. Then the log likelihood of the Poisson point process with intensity function (*u*) is

40 A. Baddeley

$$\log L = \sum\_{i=1} \log \lambda(\mathbf{x}\_i) + \int\_W (1 - \lambda(u)) \, \mathrm{d}u. \tag{2.14}$$

This can be derived either from the characteristic properties (**PP1**)–(**PP4**) of the Poisson process, or by taking the limit of the logistic regression likelihood (2.4), with appropriate rescaling, as pixel size tends to zero. See Baddeley et al. (2010), Warton and Shepherd (2010a, b), Baddeley et al. (2015, Sect. 9.7).

For the loglinear intensity model (2.10), the score equations are obtained by setting the partial derivatives of (2.14) to zero, giving

$$\int\_{\mathbf{w}} \lambda(u) \, \mathrm{d}u = n \tag{2.15}$$

$$\int\_{W} Z(u)\,\lambda(u)\,\mathrm{d}u = \sum\_{i=1}^{n} Z(\mathbf{x}\_{i})\tag{2.16}$$

and these are also the infinitesimal-pixel limits of the logistic regression score equations (2.5)–(2.6). The score equations have the same "method-of-moments" interpretation as in the discrete case: namely the left hand side of each equation is the theoretical mean value, under the model, of the statistic that is evaluated for the observed data on the right hand side.

The main practical challenge in fitting the model is the fact that Eqs. (2.14) or (2.15)–(2.16) involve an integral over the study region. Unless this integral can be simplified using calculus, it must be approximated numerically.

An important case where the integral *can* be simplified is where *Z*(*u*) takes only the values 0 and 1. This predictor might represent a particular rock type such as the greenstone in the Murchison example. If this is the only predictor, then the integrals in (2.14)–(2.16) can be evaluated exactly, given only the area of the greenstone and non-greenstone regions, because the integrands are constant in each region. Then the model can be fitted exactly. This case is a rare exception.

#### **Pixel Regression**

The simplest approximation of an integral is the midpoint rule, using the sum of values of the integrand at a regular grid of sample points. This leads to the logistic regression technique of Sect. 2.3. The observed spatial locations *x*1*,*…*, xn* of the deposits are discretised into pixel presence-absence indicators *y*1*,*…*, yN*. The predictor *Z* is evaluated at the pixel centres *cj* to give predictor values *zj* = *Z*(*cj* ), and logistic regression of *y* against *z* is performed.

Procedures of this type are well-established in statistical science. Lewis (1972) and Tukey's former student Brillinger (Brillinger 1978; Brillinger and Segundo 1979; Brillinger and Preisler 1986) showed that the likelihood of a general point process in one-dimensional time, or a Poisson point process in higher dimensions, can be usefully approximated by the likelihood of logistic regression for the discretised process. Asymptotic equivalence was established in Besag et al. (1982). This makes it practicable to fit spatial Poisson point process models of general form to point pattern data (Berman and Turner 1992; Clyde and Strauss 1991; Baddeley and Turner 2000, 2005) by enlisting efficient and reliable software already developed for generalized linear models. Approximation of a stochastic process by a generalized linear model is now commonplace in applied statistics (Lindsey 1992, 1995, 1997; Lindsey and Mersch 1992).

Complementary log–log regression is more appropriate than logistic regression in this context. A Poisson random variable *K* with mean has probability ℙ{*K* = 0} = *e*− of taking the value zero, by (2.11), and has probability *p* =1− *e*− of taking a positive value. In a Poisson point process with intensity function (*u*), the presence probability of at least one deposit in a given region *B* is therefore

$$p(B) = 1 - e^{-\mu(B)} = 1 - \exp(-\int\_B \lambda(\mu) \, \mathrm{d}\mu).$$

Inverting this relationship, the expected number of points in *B* is

$$\int\_{B} \lambda(u) \, \mathrm{d}u = -\log(1 - p(B))$$

If *B* is a small pixel of area *A*, and the intensity has the loglinear form (2.10), then the relationship between presence probability and the predictor variable is

$$
\log(-\log(1-p)) = \log A + \alpha + \beta z,
$$

which follows the complementary log–log regression relationship (2.8) rather than the logistic regression (2.1). However, the discrepancy is small in many cases, and the logistic function log(*p*∕(1 − *p*)) has slightly better numerical and computational properties, because it is the theoretically "canonical" link function (McCullagh and Nelder 1989).

#### **Berman-Turner Device**

In numerical analysis, an integral can often be approximated more accurately using a *quadrature rule*, based on a small number of well-chosen sample points, rather than a dense grid of sample points. Berman and Turner (1992) applied this principle to the Poisson point process likelihood (2.14) and developed an efficient fitting procedure based on a relatively small number of sample points.

In the Berman-Turner scheme, the sample points *u*1*,*…*, um* consist of the observed deposit locations *x*1*,*…*, xn* together with a complementary set of "dummy" points *un*+1*,*…*, um*. The integral of any function *f* is approximated using the quadrature rule

$$\int\_{W} f(\mu) \,\mathrm{d}\mu \approx \sum\_{k} w\_{k} f(\mu\_{k}),$$

where *w*1*,*…*,w<sup>m</sup>* are numerical weights chosen appropriately. For example, the weights *w<sup>k</sup>* may be the areas of the tiles of the Dirichlet-Voronoï tessellation (Okabe et al. 1992) of *W* associated with the quadrature points *u*1*,*…*, um*. The Poisson process log likelihood (2.14) is then approximated by

$$\begin{split} \log \text{BTL} &= \sum\_{i=1} \log \lambda(\mathbf{x}\_i) + \sum\_{k} w\_k (1 - \lambda(\boldsymbol{u}\_k)) \\ &= \sum\_{k} (I\_k \log \lambda(\boldsymbol{u}\_k) + w\_k (1 - \lambda(\boldsymbol{u}\_k))) \end{split} \tag{2.17}$$

where *Ik* = 1 if the quadrature point *uk* is a data point, and *Ik* = 0 if it is a dummy point. The approximate log likelihood (2.17) has the same form as the (weighted) log likelihood of a Poisson regression model, and can be fitted reliably using existing statistical software (Berman and Turner 1992). The Berman-Turner technique is the main algorithm for point process modelling in the software package spatstat (Baddeley et al. 2015, Chap. 9).

If the predictor variables are smooth functions of spatial location, then the Berman-Turner device is extremely efficient, because of the properties of numerical quadrature (Berman and Turner 1992; Baddeley and Turner 2000). This applies, for example, to the distance function of the geological faults in the Murchison example. The approximation is less accurate when the predictor is discontinuous, such as an indicator of rock type.

#### **Conditional Logistic Regression**

An alternative fitting method involves placing the "dummy" sample points at random. This is the equivalent of the procedure, already described for pixel presence/absence data, of randomly selecting a subset of the pixels where no deposit is present.

Suppose the dummy point pattern is randomly generated according to a Poisson point process with known intensity  *>* 0. Combine the two point patterns, data and dummy , into a single pattern = ∪ ; this is a realisation of a random point process with intensity (*u*) = (*u*) + . Given = {*v*1*,*…*, vJ*}, that is, given only the locations of the combined pattern of data and dummy points, let *s*1*,*…*,sJ* be indicators such that *sj* = 1 if the point *v<sup>j</sup>* is a data point, and *sj* = 0 if it is a dummy point. The probability *qj* = ℙ{*sj* = 1} that a given random point *v<sup>j</sup>* is actually a data point, equals

$$q\_j = \frac{\lambda(\nu\_j)}{\lambda(\nu\_j) + \delta},$$

the ratio of the intensity of to the intensity of . Hence

$$\log \frac{q\_j}{1 - q\_j} = \log \lambda(\mathbf{v}\_j) - \log \delta = a + \beta \mathbf{Z}(\mathbf{v}\_j) - \log \delta. \tag{2.18}$$

The data/dummy status *sj* of each point *v<sup>j</sup>* is independent of other points. It follows that the conditional likelihood of the data/dummy status of the points of , given their locations, is the likelihood of logistic regression in the form (2.18). The Poisson point process model with loglinear intensity (2.10) could be fitted by logistic regression of *sj* on *zj* = *Z*(*v<sup>j</sup>* ) given .

This technique relies on the independence properties of the Poisson point process, and is a counterpart of the well-known relationship between logistic regression and loglinear Poisson models in contingency tables (Dobson and Barnett 2008; McCullagh and Nelder 1989).

Several versions of this technique have been used for point pattern data in continuous space (Diggle and Rowlingson 1994; Baddeley et al. 2014). By using random sample points, the technique avoids bias which may occur in numerical quadrature, while potentially increasing variability due to random sampling. The variance contribution due to randomisation can be estimated, and appears to be acceptable in many cases (Baddeley et al. 2014).

### **Maximum Entropy**

The principle of maximum entropy (Dutta 1966) is often used in ecology, for example, to study the influence of habitat variables on the spatial distribution of animals or plants (Dudík et al. 2007; Elith et al. 2011; Phillips et al. 2006). Conceptually this method considers all possible spatial distributions, and finds the spatial distribution which maximises a quantity called entropy, subject to constraints implied by the observed data. The constraints are equivalent to the score equations (2.15)–(2.16) or (2.5)–(2.6). The maximum entropy solution is a probability distribution which is a *loglinear* function of the predictors. It was shown in Renner and Warton (2013) that this solution is equivalent to fitting a loglinear Poisson point process, or equivalent to logistic regression on a fine pixel grid. An analogy could be drawn with the stretching of a string: a string may take on any shape, but if we demand that the string be stretched as tight as possible, it will take up a straight line. Thus, this analysis principle is equivalent to fitting a Poisson point process model with loglinear intensity.

### *2.4.4 Murchison Example*

Here we give a worked example of Poisson point process modelling for the Murchison data of Fig. 2.1. The gold deposit locations are assumed to follow a Poisson process with intensity (*u*) assumed to be a loglinear function of distance to the nearest fault,

$$\lambda(u) = \exp(a + \beta d(u)),\tag{2.19}$$

where *,*  are parameters and *d*(*u*) is the distance from location *u* to the nearest geological fault. Contours of *d*(*u*) are shown in Fig. 2.2. The model (2.19) corresponds

**Fig. 2.5** Fitted intensity of gold deposits in the Murchison survey according to the loglinear Poisson point process model. Pixel values are intensities (number of deposits per square kilometre).

to logistic regression of pixel presence/absence indicators against distance to nearest fault.

We used the Berman-Turner device as implemented in the spatstat package (Baddeley et al. 2015) in the function ppm. The fitted parameters are *̂* = −4*.*34 and *̂*= −0*.*<sup>26</sup> km−1. These values are quite similar to the estimates in Table 2.2, as expected. The fitted intensity function,

$$\lambda(u) = \exp(-4.34 - 0.26 \, d(u)),\tag{2.20}$$

is displayed as a greyscale image in Fig. 2.5. Note that the spatial resolution of Fig. 2.5 is finer than the spacing of sample points used to fit the model; indeed (*u*) can be evaluated at any location *u* in continuous space, using (2.20).

The fitted intensity relationship (2.20) can be interpreted directly. The estimated intensity of gold deposits in the immediate vicinity of a geological fault is about exp(−4*.*34) = 0*.*013 deposits per square kilometre or 1*.*3 deposits per 100 km<sup>2</sup>. This intensity decreases by a *factor* of exp(−0*.*26) = 0*.*77 for every additional kilometre away from a fault. At a distance of 10 km, the intensity has fallen by a factor of exp(10 × (−0*.*26)) = 0*.*074 to exp(−4*.*34 + 10 × (−0*.*26)) = 0*.*001 deposits per square kilometre or 0*.*1 deposits per 100 km<sup>2</sup>. Figure 2.6 shows the effect of the

**Fig. 2.6** Fitted intensity of Murchison gold deposits as a function of distance to the nearest fault, assuming it is a loglinear function of distance. Solid line: maximum likelihood estimate. Shading: pointwise 95% confidence interval

**Fig. 2.7** Perspective view of fitted intensity surface of loglinear Poisson point process model of Murchison gold deposits against distance from nearest fault

distance covariate on the intensity function, according to the fitted loglinear Poisson model.

Figure 2.7 shows a perspective view of the fitted intensity function, treated as a surface in three dimensions. Note that, fortuitously, the southern edge of the perspective plot in Fig. 2.7 shows the shape of the fitted intensity curve in Fig. 2.6.

We caution again that this analysis has not fitted a highly flexible model in which the abundance of gold deposits depends, in some unspecified way, on the distance to the nearest fault. Rather, the very specific loglinear relationship (2.19) has been fitted. The flexible part of this analysis is the freedom to choose another predictor variable or variables to replace the distance function *d*(*u*). Once the predictors are chosen, the analysis becomes rigidly parametric.

### *2.4.5 Statistical Inference*

The Poisson point process model with loglinear intensity (2.10) belongs to the class of "exponential family" models (McCullagh and Nelder 1989). Statistical inference has been studied in detail for this class (Barndorff-Nielsen 1978) and for the Poisson process in particular (Kutoyants 1998; Rathbun and Cressie 1994).

A full set of standard tools is available for statistical inference. These include standard errors and confidence intervals for the parameter estimates, hypothesis tests (likelihood ratio test, score test), and variable selection and model selection (analysis of deviance, Akaike information criterion). See Baddeley et al. (2015, Chap. 9) for a full implementation.

Table 2.3 shows the estimated standard errors and 95% confidence intervals for the parameters in the loglinear model fitted to the Murchison data. These are asymptotic standard errors based on the Fisher information matrix.

Analysis of variance, or in this case, analysis of deviance (McCullagh and Nelder 1989; Hosmer and Lemeshow 2000; Dobson and Barnett 2008) supports a formal hypothesis test of statistical significance for the dependence on a predictor variable. For example the likelihood ratio test of the null hypothesis = 0 against the alternative ≠ 0 indicates very strong evidence that gold deposit abundance is dependent on the distance to the nearest fault.

Recently-developed tools for model selection in point process models include Sufficient Dimension Reduction (Guan and Wang 2010).


**Table 2.3** Standard errors and confidence intervals for parameters in loglinear Poisson model of Murchison data

### *2.4.6 Diagnostics*

A fitted model is not like a fitted shoe. A shoe must approximately match the shape of the wearer's foot before we call it fitted. On the contrary, statistical software "fits" a model to data on the assumption that the model is true, and does not check that the model describes the data at all.

Diagnostic quantities and diagnostic plots for a fitted model should be used to check the model assumptions. For linear regression and linear models, diagnostics are highly developed in statistical theory and applied statistical practice (Atkinson 1985). For logistic regression in a general context, diagnostics are also available (Landwehr et al. 1984; Dobson and Barnett 2008) and these extend to the "exponential family" class of models, at least in theory.

Diagnostics for the Poisson point process model, corresponding to the well-known diagnostics for logistic regression, were developed in Baddeley et al. (2013a, b). Two of these are shown here for the Murchison data.

The *influence* measure *ei* is the effect on the fitted log likelihood of deleting the *i*th deposit point *xi* (Baddeley et al. 2013a). Figure 2.8 shows circles of diameter proportional to *ei* centred at the deposit locations *xi* . The geological fault pattern is also shown. In this Figure, large circles represent observations which had a large effect on the resulting fitted model. There is one very large circle at middle left of the Figure, and we notice that there are no geological faults near this deposit. That is,

**Fig. 2.8** Influence diagnostic for the loglinear Poisson model of gold deposits against distance to nearest fault. Circle diameters are proportional to the influence of each deposit. Grey lines are geological faults

the influence diagnostic identifies this deposit as anomalous, perhaps an "outlier", with respect to the fitted model in which deposits are most likely to occur close to a geological fault. This is an entirely data-driven diagnostic, and tells us only that this observation is anomalous with respect to the model. It is unable to tell us whether the deposit is truly anomalous in geological terms, or whether the survey perhaps failed to detect an existing geological fault near this location.

Strategies for dealing with anomalous data include outlier detection and removal, and robust model-fitting which is resistant to the effects of outliers. Robust parameter estimation for Poisson point process models was developed in Assunção and Guttorp (1999).

Figure 2.9 shows a *partial residual* plot (Baddeley et al. 2013b) for the Murchison gold deposits against distance to nearest fault. Assuming that the loglinear model (2.19) is approximately true, say log (*u*) = + *d*(*u*) + *H*(*d*(*u*)) where the error *<sup>H</sup>*(*d*) is small, this procedure forms an estimate *<sup>H</sup>̂*(*d*) of the error term, adds it to the fitted linear term, and plots *̂* + *̂<sup>d</sup>* <sup>+</sup> *<sup>H</sup>̂*(*d*) against values of distance *<sup>d</sup>*. If the model is correct, this plot should be a straight line. Departures from the straight line can be interpreted as suggesting the correct form of dependence. Figure 2.9 suggests there is a minor departure from the loglinear model.

An alternative way to explore non-linearity is to fit a polynomial or spline function in place of the linear function on the right hand side of (2.1) or (2.19). In order to avoid over-fitting and numerical instability, the model should be fitted by *penalised* maximum likelihood, in which the log likelihood (2.14) is augmented by a penalty term that discourages extreme values of the parameters which might produce a wildly-oscillating polynomial. Figure 2.10 shows a penalised maximum

likelihood fit of a model in which the log intensity is a fifth-order B-spline function of distance to the nearest fault. The model was fitted in the spatstat package using code for Generalised Additive Models (GAM) (Hastie and Tibshirani 1990). This fit also suggests minor departure from the loglinear model.

### *2.4.7 Rationale for Prediction*

Up to this point, our commentary on prospectivity analysis applies equally well to the analysis of archaeological finds, plant species distribution, etc., using logistic regression and related tools. However, the key goal of prospectivity analysis is the *prediction of previously-unknown deposits*, and this sets it apart from other applications. This prediction problem deserves more attention from the statistical community, and we shall identify several topics for research.

The rationale for predicting "new" mineral deposits is clearest when we extrapolate from a fully-explored region to an unexplored region. We might extrapolate from a previous, fully-explored mining lease to a newly-granted exploration lease which is geologically analogous. We fit a model to the fully-explored region, obtaining estimates of the model parameters *,* , which we believe can be extrapolated to the unexplored region. Applying the fitted model relationship to the predictor variables for the new region, we obtain explicit predictions for the mineral deposits in the new region. These predictions may include expected numbers of deposits, probability of no deposits, probability distribution of distance to the nearest deposit, and so on. These predictions are valid calculations even if the geological structure in the two regions was formed at the same epoch, because of the assumption of independence between deposits. Essentially the fully-explored region is used to obtain estimates of the parameters of the "laws" which apply to both regions, and these laws are then applied to the new region.

The statistical reasoning is far more complicated when we wish to predict hithertoundiscovered mineral deposits from known deposits *in the same region*. It would be futile to assume that the region has been fully explored, since this would imply there are no deposits remaining to be discovered. Instead our statistical model must now recognise two categories of deposits, known and unknown. The methods described above can be re-deployed if we assume that the true spatial pattern of *all* deposits (whether known or unknown) is a Poisson point process with intensity function (*u*), say, and that a deposit existing at a location *u* will be detected with probability *P*(*u*), independently of other deposits. Then, by the "thinning" property of the Poisson process, the pattern of *detected* deposits is also a Poisson process, with intensity (*u*) = *P*(*u*)(*u*); the pattern of *undetected* deposits is a Poisson process with intensity (*u*) = (1 − *P*(*u*))(*u*); and the detected and undetected deposits are independent of each other. Fitting a Poisson point process model to the observed mineral deposits allows us to estimate (*u*) only. If the detection probability *P*(*u*) is known, then it becomes feasible to back-calculate (*u*) = (*u*)∕*P*(*u*) and

$$
\xi(u) = \frac{1 - P(u)}{P(u)} \lambda(u). \tag{2.21}
$$

It is then possible to make predictions or conditional simulations of the undetected deposits. The independence property of the Poisson process implies that the prediction or conditional simulation depends only on the fitted model parameters, and does not otherwise depend on the observed deposits. The conditional simulation is a realisation of the Poisson process of the assumed loglinear form with the parameter values fitted from the data: the simulated realisation is independent of the observed deposits, given the fitted model parameters.

This argument is an instance of the prediction approach to survey sampling inference (Royall 1988). The difficulty is that the detection probability *P*(*u*) will depend on the detection method, the spatially-varying amount of survey effort, and other factors. If *P*(*u*) can be estimated from data, perhaps by comparing the results of successive surveys of the same region, then the form of (2.21) suggests that the appropriate model is a logistic regression for *P*(*u*) on explanatory variables. If no information is available about *P*(*u*), we could make the simplifying assumption that *P*(*u*) ≡ *P* is constant; then (*u*) is a constant multiple of (*u*), so that at least the relative prospectivity of different locations *u* can be assessed from a plot of (*u*).

Other, non-Poisson point processes can also serve as models of mineral deposits (Baddeley et al. 2015, Chaps. 12 and 13) and support prediction and conditional simulation. In such models, the presence of a point affects the probability of presence of a point at nearby locations. In this case the conditional simulation does depend on the observed deposit locations (Møller and Waagepetersen 2004; Baddeley et al. 2015).

A more realistic model of the detection process would envisage that the discovery of a new deposit will encourage the exploration geologist to survey the surrounding areas more intensively, increasing the detection probability in these surrounding areas. This destroys the independence structure: the pattern of observed deposits is no longer a Poisson point process, and is spatially clustered. Non-Poisson point process models would be needed to describe the spatial pattern of observed deposits, and even if the spatial pattern of all deposits is assumed to be Poisson, the pattern of undiscovered deposits is both non-Poisson and dependent on the observed deposits. A full analysis of the prediction problem would require the deployment of Missing Data principles (Little and Rubin 2002).

In prospectivity analysis it may or may not be desirable to fit any explicit relationship between deposit abundance and predictors such as distance to the nearest fault. Often the objective is simply to select a distance threshold, so as to delimit the area which is considered highly prospective (high predicted intensity) for the mineral. The ROC curve (Sect. 2.7) is more relevant to this exercise. However, if credible models can be fitted, they contain much more valuable predictive information.

### **2.5 Monotone Regression**

The remainder of this article describes three alternative analysis techniques, genuinely different from logistic regression, which do not seem to be widely used in prospectivity analysis. These techniques are genuinely "non-parametric" in the sense that they assume only that the intensity or rate of mineral deposits (*u*) is a function of the predictor variable *Z*(*u*) at the same location *u*,

$$
\lambda(u) = \rho(\mathbf{Z}(u))\tag{2.22}
$$

where (*z*) is a function to be estimated. We do not assume that (*z*) has any particular functional form.

The assumption (2.22) is encountered frequently. In geological applications where the points are the locations of mineral deposits, is an index of the prospectivity (Bonham-Carter 1995) or predicted frequency of deposits as a function of geological and geochemical covariates *z*. In ecological applications where the points are the locations of individual organisms, is a "resource selection function" (Manly et al. 1993) reflecting preference for particular environmental conditions *z*.

In *monotone regression*, we assume that (*z*) is a monotone function of *z*, either monotone increasing (non-decreasing):

$$z\_1 < z\_2 \quad \text{implies} \quad \rho(z\_1) \le \rho(z\_2)$$

or monotone decreasing (non-increasing):

$$z\_1 < z\_2 \quad \text{implies} \quad \rho(z\_1) \ge \rho(z\_2).$$

Sager (1982) considered the log-likelihood of the Poisson point process with intensity (2.22),

$$\log L = \sum\_{i} \log \rho(Z(\mathbf{x}\_i)) - \int\_{W} \rho(Z(u)) \, \mathrm{d}u,\tag{2.23}$$

and showed that the log-likelihood can be maximised over the class of all monotone functions . The optimal function *̂*(*z*) is the *nonparametric maximum likelihood* estimate of (*z*) under the monotonicity constraint, or simply the *monotone regression* estimate.

To simplify the discussion, assume that (*z*) is monotone decreasing, and that the values of *Z*(*u*) are real numbers greater than or equal to zero. Sager (1982) showed that the monotone regression estimate *̂*(*z*) is piecewise constant, with jumps occurring only at the observed values *zi* = *Z*(*xi* ) of the predictor at the deposit point locations. For any *z* let

$$A(z) = |\{ u \in W \; ; \; Z(u) \le z \}|\tag{2.24}$$

be the area of the subset of the survey region where the covariate value is less than or equal to *z*. Also let *N*(*z*) = ∑ *<sup>i</sup>* {*zi* ≤ *z*} be the number of data points for which the covariate value is less than or equal to *z*. In the Murchison example, *A*(*z*) is the area lying closer than *z* kilometres from the nearest fault, and *N*(*z*) is the number of deposits lying in this region. Then the monotone regression estimate *̂*(*z*) is the maximum of simple functions

$$\widehat{\rho}(z) = \max\_{i} \rho\_{i}(z) \tag{2.25}$$

where

$$\rho\_i(z) = \begin{cases} \frac{N(z\_i)}{A(z\_i)} \text{ if } z < z\_i\\ 0 & \text{if } z \ge z\_i. \end{cases} \tag{2.26}$$

The monotone regression estimate *̂*(*z*) can be computed rapidly using the Pool Adjacent Violators algorithm (Barlow et al. 1972) or the following Maximum Upper Sets algorithm (Sager 1982):


Figure 2.11 shows the monotone regression estimate of the intensity of gold deposits as a function of distance to nearest fault in the Murchison data. The curve has the same overall shape as the exponential curve implied by the loglinear Poisson point process model or logistic regression (Fig. 2.6), except for a prominent plateau between *z* = 2 and *z* = 6 km.

Note also that the monotone regression estimate of intensity at small distances *z* is higher (Fig. 2.11) than in the loglinear Poisson model (Fig. 2.6). This is expected, since the estimate *̂*(*z*) depends primarily on the part of the survey where *Z*(*u*) ≤ *z*. This is more satisfactory than the behaviour of the loglinear Poisson model for which the fitted curve depends on the entire dataset. If, for example, we were to restrict the study area to the region lying at most 20 km from the nearest fault, the estimates of the parameters *,*  in the loglinear Poisson model could change markedly, while the monotone regression in Fig. 2.11 would be unchanged.

Figure 2.12 shows a perspective plot of the predicted intensity implied by the monotone regression. Compared with Fig. 2.7, this shows qualitatively the same effect of a dense concentration close to the geological faults, but with a different profile (again fortuitously displayed at the southern edge of the plot).

Sager (1982) shows that this method extends to multiple predictor variables. The author believes it can also be extended to allow points to have weights determined by the mineral endowment of the deposit, or a similar characteristic.

**Fig. 2.12** Perspective view of fitted intensity using monotone regression

### **2.6 Nonparametric Curve Estimation**

A second alternative to logistic regression is nonparametric curve estimation, in which we assume that the intensity is a smooth function of the predictor, (*u*) = (*Z*(*u*)), and estimate the function (*z*) by nonparametric smoothing. This was developed in Baddeley et al. (2012), Guan (2008).

Assume that Eq. (2.22) holds, and that (*z*) is a continuous function of *z*, and that *Z*(*u*) is at least a continuous function of location *u*, without further constraints. Nonparametric estimation of is closely connected to estimation of a probability density from biased sample data (Jones 1991; El Barmi and Simonoff 2000) and to the estimation of relative densities (Handcock and Morris 1999). Under the smoothness assumptions, is proportional to the ratio of two probability densities, the numerator being the density of covariate values at the points of the point process, while the denominator is the density of covariate values at random locations in space. Kernel smoothing can be used to estimate the function as a relative density (Baddeley et al. 2012; Guan 2008).

Define the *spatial distribution function* (Lahiri 1999; Lahiri et al. 1999) as the cumulative distribution function of the covariate value *Z*(*U*) at a random point *U* uniformly distributed in *W*:

$$G(z) = \frac{1}{|W|} \int\_W \mathbf{1}\{Z(\mu) \le z\} \,\mathrm{d}\mu. \tag{2.27}$$

Here we use the 'indicator' notation: {…} equals 1 if the statement '…' is true, and <sup>0</sup> if the statement is false. Equivalently *<sup>G</sup>*(*z*) = *<sup>A</sup>*(*z*)∕*A*(∞) = *<sup>A</sup>*(*z*)∕|*W*<sup>|</sup> where *<sup>A</sup>*(*z*), defined in (2.24), is the area of the set of all locations in *W* where the covariate value is less than or equal to *z*. In practice *G*(*z*) would often be estimated by evaluating the covariate at a fine grid of pixel locations, and forming the cumulative distribution function

$$G(z) = \frac{\#\{\text{pixels } \mu \text{ : } Z(\mu) \le z\}}{\#\text{pixels}}.\tag{2.28}$$

Three estimators of proposed in Baddeley et al. (2012) are

$$\widehat{\rho}\_{\mathbb{R}}(z) = \frac{1}{|W|G'(z)} \sum\_{i} \kappa(\mathcal{Z}(\mathbf{x}\_i) - z) \tag{2.29}$$

$$\hat{\rho}\_{\rm W}(z) = \sum\_{i} \frac{1}{|W|G'(Z(\mathbf{x}\_i))} \kappa(Z(\mathbf{x}\_i) - z) \tag{2.30}$$

$$\hat{\rho}\_T(z) = \frac{1}{|W|} \sum\_i \kappa(G(\mathbf{Z}(\mathbf{x}\_i)) - G(z)) \tag{2.31}$$

where *x*1*,*…*, xn* are the data points, *Z*(*xi* ) are the observed values of the covariate *<sup>Z</sup>* at the data points, <sup>|</sup>*W*<sup>|</sup> is the area of the observation window *<sup>W</sup>*, and is a *one-dimensional*smoothing kernel—smoothing is conducted on the observed values *Z*(*xi* ) rather than in the window *W*. The derivative *G*′ (*z*) is usually approximated by differentiating a smoothed estimate of *G*. The estimators (2.29)–(2.31) were developed in Baddeley et al. (2012) by adapting estimators from kernel smoothing (Jones 1991; El Barmi and Simonoff 2000). An estimator similar to (2.29) was proposed in Guan (2008).

Figure 2.13 shows the fitted estimate of intensity for the Murchison gold deposits as a function of distance to the nearest fault. The plot shows the ratio estimator *̂R*(*z*) against *z*, equation (2.29), together with the pointwise 95% confidence interval for (*z*) based on asymptotic theory assuming a Poisson process (Baddeley et al. 2012). Tickmarks on the horizontal axis show the observed distance values *zi* = *Z*(*xi* ) at the deposits.

The overall shape of Fig. 2.13 is consistent with Figs. 2.6 and 2.11. A plateau of intensity is visible between *z* = 2*.*5 and *z* = 5*.*5 km, consistent with the plateau seen in Fig. 2.11. The peak of intensity in Fig. 2.13 occurs at about *z* = 1 km, rather than at distance *z* = 0, but this may be an artefact of the smoothing procedure, as it is not seen in the other two estimates (2.30) and (2.31).

Figure 2.14 shows a perspective view of the predicted intensity using nonparametric curve estimation. This is quite similar to the surface obtained by monotone regression, shown in Fig. 2.12.

The nonparametric curve estimate has the attractive property that *̂*(*z*) depends only on the survey information from locations where the predictor value is approximately equal to *z*. In the Murchison example, the estimated intensity *̂*(*z*) of gold deposits at a distance *z*from the nearest fault, is estimated using only the deposits and non-deposit locations which lie approximately *z* km from the nearest fault. Although smoothing artefacts may be present, this property means that the nonparametric

**Fig. 2.13** Fitted intensity of gold deposits as a function of distance to the nearest fault, using kernel-based estimator

**Fig. 2.14** Perspective view of fitted intensity using nonparametric curve estimate of

curve estimate can be treated as an estimate of the true relationship between intensity and predictor.

The estimators (2.29)–(2.31) can be modified to incorporate numerical weights, for example, representing the endowment of each deposit. Then (*z*) has the interpretation of the expected *total endowment* per unit area, of deposits at a distance *z* from the nearest fault.

Figure 2.15 shows the three estimates of (*z*) together. Logistic regression, loglinear Poisson point process modelling, and maximum entropy methods effectively assume that prospectivity is an exponential function (*z*) = *ABz* = exp( + *z*), while monotone regression assumes (*z*) is a decreasing function of *z*, and kernel estimation assumes (*z*) is a smooth function of *z* without further restriction.

This analysis assumes that the intensity at a location *u* depends *only* on the covariate value *Z*(*u*). To validate the assumption (2.22) we can compare the predicted intensity *̂*(*Z*(*u*)) assuming (2.22) with a (spatial) kernel estimate *̂*(*u*) which does not assume (2.22). If the assumption is not true, *̂*(*z*) is still meaningful: it is effectively an estimate of the average intensity (*u*) over all locations *u* where *Z*(*u*) = *z*.

### **2.7 ROC Curves**

Suppose we agree that the ultimate goal of prospectivity analysis is to decide which parts of an exploration area are most likely to contain a valuable deposit. Then the essential task is to *classify* different parts of the exploration area into areas of high and low prospectivity, rather than necessarily needing to model the degree of prospectivity at every location.

The *Receiver Operating Characteristic (ROC)* curve (Krzanowski and Hand 2009) is a summary of the performance of a classifier. It is often applied to medical diagnostic tests (Nam and D'Agostino 2002) when the test is based on thresholding a quantitative assay. Suppose for example that a medical test returns a "positive" result (predicting a high risk of disease) if the patient's blood cholesterol level exceeds a threshold *t*. For a given choice of threshold *t*, the "true positive rate" TP(*t*) is the fraction of patients with the disease who return a correct, positive test result. The "false positive rate" FP(*t*) is the fraction of patients who do not have the disease but who return an incorrect, positive test result. The ROC curve is a plot of true positive rate TP(*t*) against false positive rate FP(*t*) for all thresholds *t*. A good classifier has a large true positive rate in comparison to the false positive rate, so the ROC curve of a good classifier will lie well above the diagonal line on the graph.

The same technique can be applied to prospectivity analysis (Rakshit et al. 2017), taking the mineral deposits as the "disease", and using either a spatial predictor *Z*(*u*) or a fitted model intensity (*u*) to classify pixels into high or low prospectivity classes. Suppose that *Z*(*u*) is a real-valued spatial predictor. Calculate the empirical cumulative distribution function of *Z* at the observed deposit locations,

$$\widehat{F}\_{\mathbf{x}}(t) = \frac{1}{n(\mathbf{x})} \sum\_{i} \mathbf{1}\{Z(\mathbf{x}\_i) \le t\}$$

and the empirical "spatial distribution function" of *Z*(*u*) over all locations *u* in *W*,

$$F\_W(t) = \frac{1}{|W|} \int\_W \mathbf{1}\{Z(\mu) \le t\} \,\mathrm{d}\mu.$$

Then the ROC plot is a graph of 1 − *<sup>F</sup>̂*(*t*) against 1 − *FW* (*t*) for all *<sup>t</sup>*. Equivalently it is a plot of *R*+(*p*)=1− *F*(*F*−1 *<sup>W</sup>* (1 − *p*)) against *p*. Applied statisticians would recognise this as a form of the classical P–P plot.

The formulae above assume that larger values of *Z* are more prospective. If smaller values of *Z* are more prospective, then the appropriate ROC plot is a graph of *F*(*t*) against *Fw*(*t*) for all *t*, or equivalently a graph of *R*−(*p*) = *F*(*F*−1 *<sup>W</sup>* (*p*)) against *p*. This is the P–P plot of *F* against *FW* .

Figure 2.16 shows the ROC curve for the Murchison gold deposits against distance from nearest fault, assuming smaller distances are more prospective. The horizontal axis shows the fraction of area in the survey region which lies less than *t* km away from a fault, and the vertical axis shows the fraction of deposits which lie less than *t* km from a fault. For example, we may read off the plot that 60% of all known deposits lie in a region occupying 10% of the survey area defined by a distance threshold. This has a useful practical interpretation. The threshold itself is not shown on the ROC plot but could be obtained from the spatial cumulative distribution function. The ROC curve depends on the choice of the study region (Jiménez-Valverde 2012).

The interpretation of ROC curves in spatial analysis is controversial. Some writers suggest (Fielding and Bell 1997) that the ROC can be used to evaluate the goodnessof-fit of a species distribution model, or equivalently the goodness-of-fit of a prospectivity analysis. Others disagree (Lobo et al. 2007, p. 146) and argue that the ROC is an indicator of predictive power—the ability to segregate pixels reliably into two classes of high and low prospectivity.

The ROC can be based either on a real-valued predictor variable *Z*(*u*) or on a fitted model intensity (*u*). In the latter case it is tempting to regard the ROC as a summary of the predictive power of the fitted model (Lobo et al. 2007; Austin 2007; Thuiller et al. 2003). However, if the model is logistic regression, or a loglinear Poisson point process, or if the intensity is estimated using monotone regression, then the fitted intensity (*u*) is a monotone function of the predictor *Z*(*u*). Thresholding (*u*) is equivalent to thresholding *Z*(*u*), so that the ROC curves derived from any of these models are identical. For example, the ROC cannot be used to compare the predictive power of logistic regression against that of monotone regression. It would be more appropriate to regard the ROC as a summary of the inherent predictive power of the predictor variable *Z*(*u*) itself (Rakshit et al. 2017).

The ROC does have a connection with the other techniques described in this article. Suppose that the point process intensity (*u*) depends on the predictor *Z*(*u*) through a function (*z*) as in (2.22). Then we show in (Rakshit et al. 2017) that the slope of the ROC curve is closely related to . If large values of the predictor are more prospective, the slope of the ROC curve is

$$\frac{\mathbf{d}}{\mathbf{d}p}R\_+(p) = \frac{\mathbf{d}}{\mathbf{d}p}\left[1 - F\_\mathbf{x}(F\_W^{-1}(1-p))\right] = \frac{1}{\kappa}\rho(F\_W^{-1}(1-p))$$

while if small values of the predictor are prospective, the slope is

$$\frac{\mathbf{d}}{\mathbf{d}p} R\_{-}(p) = \frac{\mathbf{d}}{\mathbf{d}p} \left[ F\_{\mathbf{x}} (F\_{W}^{-1}(p)) \right] = \frac{1}{\kappa} \rho (F\_{W}^{-1}(p))^{\*}$$

where is the average intensity over the study region. Analysis using the ROC curve is not fundamentally different from fitting a point process model or pixel presence/absence regression model, but may be a more practically useful presentation of the same information.

### **2.8 Recursive Partitioning**

Classification and Regression Tree (CART) (Breiman et al. 1984) or Recursive Partitioning methods offer another alternative approach. Given one or many predictor variables, these methods predict the response by thresholding the predictors. The result is a prediction rule, organised as a logical tree, in which each fork of the tree is a threshold operation on one of the predictors. This kind of rule would appear to be well-suited to the practical needs of prospectivity analysis.

For a single predictor variable, the result of recursive partitioning is a piecewiseconstant function *̂*(*z*) which is not constrained to be monotone. Figure 2.17 shows the estimated intensity of the Murchison gold deposits as a function of distance to the nearest fault only, using recursive partitioning. Any number of predictor variables can be included in the analysis.

#### **Software and Data**

All analyses in this chapter were performed using the spatstat library (Baddeley et al. 2015) which is a contributed extension package for the R statistical software system (R Development Core Team 2011). Both R and spatstat can be downloaded from https://cran.r-project.org. The Murchison data are included in

spatstat. Software scripts for the analyses in this chapter are available at www. spatstat.org.

**Acknowledgements** This research would not have been possible without enthusiastic support and collaboration of the Centre for Exploration Targeting (CET) at the University of Western Australia, in particular Professor Eun-Jung Holden. The research was funded by the Australian Research Council under a Discovery Outstanding Researcher Award.

This article includes summaries of the results of joint research with the Perth Spatial Point Processes Group (Ya-Mei Chang, Andrew Hardegen, Thomas Lawrence, Gopalan Nair, Suman Rakshit and Yong Song) and with collaborators Ege Rubak, Rob Foxall and Rolf Turner. Valuable advice was given by Professors Allan Trench and Mike Dentith.

### **References**


Wakefield J (2007) Disease mapping and spatial regression with count data. Biostatistics 8:158–183 Waller L, Gotway C (2004) Applied spatial statistics for public health data. Wiley


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **Chapter 3 Testing Joint Conditional Independence of Categorical Random Variables with a Standard Log-Likelihood Ratio Test**

### **Helmut Schaeben**

**Abstract** While tests for pairwise conditional independence of random variables have been devised, testing joint conditional independence of several random variables seems to be a challenge in general. Restriction to categorical random variables implies in particular that their common distribution may initially be thought of as contingency table, and then in terms of a log-linear model. Thus, Hammersley– Clifford theorem applies, and provides insight in the factorization of the log-linear model corresponding to assumptions of independence or conditional independence. Such assumptions simplify the full joint log-linear model, and in turn any conditional distribution. If the joint log-linear model corresponding to the assumption of joint conditional independence given the conditioning variable is not sufficiently large to explain some data according to a standard log-likelihood test, its null–hypothesis of joint conditional independence may be rejected with respect to some significance level. Enlarging the log-linear model by some product terms of variables and running the log-likelihood test on different models may provide insight which variables are lacking conditional independence. Since the joint distribution determines any conditional distribution, the series of tests eventually provides insight which variables and product terms a proper logistic regression model should comprise.

### **3.1 Introduction**

Conditional independence is a probabilistic approach to causality (Suppes 1970; Dawid 1979, 2004, 2007; Spohn 1980, 1994; Pearl 2009; Chalak and White 2012) while for instance correlation is obviously not as it is a symmetric relationship. Features of conditional independence are

© The Author(s) 2018

H. Schaeben (✉)

Geophysics and Geoinformatics, TU Bergakademie Freiberg, Freiberg, Germany e-mail: schaeben@tu-freiberg.de

B. S. Daya Sagar et al. (eds.), *Handbook of Mathematical Geosciences*, https://doi.org/10.1007/978-3-319-78999-6\_3


Statistical tests for pairwise conditional independence of random variables have been devised, e.g., Bergsma (2004), Su and White (2007), Su and White (2008), Song (2009), Bergsma (2010), Huang (2010), Zhang et al. (2011), Bouezmarni et al. (2012), Györfi and Walk (2012), Doran et al. (2014), Ramsey (2014), Huang et al. (2016), testing joint conditional independence of several random variables seems to be a challenge in general. For the special case of dichotomous variables, the "omnibus test" (Bonham-Carter 1994) and the "new omnibus test" (Agterberg and Cheng 2002) have been suggested.

Weak conditional independence of random variables was introduced in Wong and Butz (1999), and elaborated on in Butz and Sanscartier (2002). Extended conditional independence has recently been introduced in Constantinou and Dawid (2015). The definition of weak conditional independence given in Cheng (2015) refers to conditional independent random events, and rephrases conditional independence in terms of ratios of conditional probabilities rather than conditional probabilities to avoid the distinction of conditional independence given a conditioning event or its complement. This definition becomes irrelevant when proceeding from elementary probability of events to probability of random variables, and to the general definition of conditionally independent random variables.

Conditional independence is an issue in a Bayesian approach to estimate posterior (conditional) probabilities of a dichotomous random target variable in terms of weights-of-evidence (Good 1950, 1960, 1985). In turn, conditional independence is the major mathematical assumption of potential modeling with weights of evidence, cf. (Bonham-Carter et al. 1989; Agterberg and Cheng 2002; Schaeben 2014b), e.g., applied to prospectivity modeling of mineral deposits. The method requires a training dataset laid out in regular cells (pixels, voxels) of equal physical size representing the support of probabilities. The sum of posterior probabilities over all cells equals the sum of the target variable over all cells. Deviations indicate a violation of the assumption of conditional independence, and are used as statistic of a test (Agterberg and Cheng 2002) which involves a normality assumption. Funny enough, ArcSDM calculates so-called normalized probabilities, i.e., posterior probabilities rescaled so that the overall measure of conditional independence is satisfied (ESRI 2018); of course, the trick does not fix any problem. Violation of the assumption of conditional independence does not only corrupt the posterior (conditional) probabilities estimated with weights of evidence, but also their ranks, cf. (Schaeben 2014b), which is worse. Thus, the method of weights-of-evidence requires the mathematical modeling assumption of conditional independence to yield reasonable predictions. However, conditional independence is an issue with respect to logistic regression, too.

### **3.2 From Contingency Tables to Log-Linear Models**

A comprehensive exposure of log-linear models is Christensen (1997). Let *Z* be a random vector of categorical random variables *,* = 0*,*…*, <sup>m</sup>*, i.e., *<sup>Z</sup>* <sup>=</sup> (0*,* 1*,*…*, m*) . It is completely characterized by its distribution

$$p\_{\kappa} = P\_{\mathbf{Z}}(\mathbf{s}\_{\kappa}) = P(\mathbf{Z} = \mathbf{s}\_{\kappa}) = P\left( (\mathbf{Z}\_0, \dots, \mathbf{Z}\_m) = (\mathbf{s}\_{k\_0}, \dots, \mathbf{s}\_{k\_m}) \right),$$

with the multi-index = (*k*0*,*…*, km*), where *sk* with *<sup>k</sup>* = 1*,*…*,K* denotes all possible categories of the categorical random variable *,* = 0*,*…*, <sup>m</sup>*. Since it is assumed that there is a total of *<sup>K</sup>* different categories with *PZ* (*sk* ) *<sup>&</sup>gt;* 0, there is a total of <sup>∏</sup>*<sup>m</sup>* =0 *<sup>K</sup>* different categorical states for *<sup>Z</sup>* <sup>=</sup> <sup>⨂</sup>*<sup>m</sup>* =0 .

The distribution of a categorical random vector may initially be thought of as being provided by contingency tables. More conveniently, the distribution of a categorical random vector *Z* can generally be written in terms of a log-linear model as

$$\log p\_{\mathbf{x}} = \sum\_{\kappa} w\_{\kappa} \, f\_{\mathbf{Z}}^{\kappa}(\mathbf{z})$$

with

$$\begin{aligned} \boldsymbol{w}\_{\boldsymbol{\kappa}} &= \log p\_{\boldsymbol{\kappa}}, \\ \boldsymbol{f}\_{\mathcal{Z}}^{\boldsymbol{\kappa}}(\mathbf{z}) &= \mathbf{I}\_{\{\boldsymbol{s}\_{\boldsymbol{\kappa}}\}}(\mathbf{z}) = \mathbf{I}\_{\{\boldsymbol{s}\_{\boldsymbol{k}\_{0}}, \dots, \boldsymbol{s}\_{\boldsymbol{k}\_{m}}\}}(\boldsymbol{z}\_{0}, \dots, \boldsymbol{z}\_{m}). \end{aligned}$$

### **3.3 Independence, Conditional Independence of Random Variables**

If the random variables *,* = 1*,*…*, <sup>m</sup>*, are independent, then the joint probability of any subset of random variables can be factorized into the product of the individual probabilities, i.e.,

$$P\_{\bigotimes\_{\ell \in \mathcal{M}} Z\_{\ell}} = \bigotimes\_{\ell \in \mathcal{M}} P\_{Z\_{\ell}}.$$

where *<sup>M</sup>* denotes any non-empty subset of the set {1*,*…*, <sup>m</sup>*}. In particular

$$P\_Z = P\_{\bigotimes\_{\ell=1}^m Z\_{\ell}} = \bigotimes\_{\ell=1}^m P\_{Z\_{\ell}}.$$

If the random variables *,* = 1*,*…*, <sup>m</sup>*, are conditionally independent given 0, then the joint conditional probability of any subset of random variables given 0 can be factorized into the product of the individual conditional probabilities, i.e.,

70 H. Schaeben

$$P\_{\bigotimes\_{\ell \in M} \mathbb{Z}\_{\ell} | \mathbb{Z}\_0} = \bigotimes\_{\ell' \in M} P\_{\mathbb{Z}\_{\ell} | \mathbb{Z}\_0},\tag{3.1}$$

and in particular

$$P\_{\bigotimes\_{\ell=1}^{m} \mathbb{Z}\_{\ell} \mid \mathbb{Z}\_0} = \bigotimes\_{\ell=1}^{m} P\_{\mathbb{Z}\_{\ell} \mid \mathbb{Z}\_0}.$$

### **3.4 Logistic Regression, and Its Special Case of Weights-of-Evidence**

Conditional expectation of a dichotomous random target variable 0 given a *<sup>m</sup>*– variate random predictor vector *<sup>Z</sup>* = (1*,*…*, <sup>m</sup>*) is equal to a conditional probability, i.e.,

$$\operatorname{E}(\mathbb{Z}\_0 \mid \mathbf{Z}) = P(\mathbb{Z}\_0 = 1 \mid \mathbf{Z}).$$

Then the ordinary logistic regression model (without interaction terms) neglecting the error term yields

$$\text{logit}P(\mathbb{Z}\_0 = 1 \mid \mathbf{Z}) = \boldsymbol{\beta}\_0 + \boldsymbol{\mathcal{J}}^{\mathsf{T}}\mathbf{Z}, \ \boldsymbol{\beta}\_0 \in \mathbb{R}, \boldsymbol{\mathcal{J}} \in \mathbb{R}^m.$$

Omitting the error term it can be rewritten in terms of a probability as

$$P\left(\mathbb{Z}\_0 = 1 \mid \mathbf{Z}\right) = A\left(\boldsymbol{\beta}\_0 + \boldsymbol{\mathcal{J}}^\top \mathbf{Z}\right),$$

where denotes the logistic function. The logistic regression model with interaction terms reads in terms of a logit transformed probability

$$\text{logit}P(\mathbb{Z}\_0 = 1 \mid \mathbf{Z}) = \beta\_0 + \sum\_{\ell} \beta\_\ell \mathbb{Z}\_\ell + \sum\_{\ell\_i, \dots, \ell\_j} \beta\_{\ell\_i, \dots, \ell\_j} \mathbb{Z}\_{\ell\_i} \dots \mathbb{Z}\_{\ell\_j}),\tag{3.2}$$

and in terms of a probability

$$P\left(\mathbf{Z}\_0 = 1 \mid \mathbf{Z}\right) = A \left(\boldsymbol{\theta}\_0 + \sum\_{\ell} \boldsymbol{\theta}\_{\ell} \mathbf{Z}\_{\ell} + \sum\_{\ell\_i, \dots, \ell\_j} \boldsymbol{\theta}\_{\ell\_i, \dots, \ell\_j} \mathbf{Z}\_{\ell\_i} \dots \mathbf{Z}\_{\ell\_j}\right).$$

If all predictor variables are dichotomous variables and conditionally independent given the target variable then the parameters of the ordinary logistic regression model simplify to

$$\beta\_0 = \text{logit}P(\mathbb{Z}\_0 = 1) + W^{(0)}, \quad \beta\_{\ell'} = C\_{\ell'}, \ \ell' = 1, \dots, m,$$

3 Testing Joint Conditional Independence . . . 71

with contrasts

$$C\_{\ell'} = W\_{\ell}^{(1)} - W\_{\ell'}^{(0)}, \ \ \ell = 1, \dots, m,$$

defined as differences of weights of evidence

$$W\_{\ell}^{(1)} = \ln \frac{P(\mathbb{Z}\_{\ell} = 1 \mid \mathbb{Z}\_{0} = 1)}{P(\mathbb{Z}\_{\ell} = 1 \mid \mathbb{Z}\_{0} = 0)}, \quad W\_{\ell}^{(0)} = \ln \frac{P(\mathbb{Z}\_{\ell} = 0 \mid \mathbb{Z}\_{0} = 1)}{P(\mathbb{Z}\_{\ell} = 0 \mid \mathbb{Z}\_{0} = 0)},$$

and with *<sup>W</sup>*(0) <sup>=</sup> <sup>∑</sup>*<sup>m</sup>* =1 *<sup>W</sup>*(0) provided all conditional probabilities are different from 0 (Schaeben 2014b). Obviously the model parameters become independent of one another, and can be estimated by mere counting. This special case of a logistic regression model is usually referred to as the method of "weights-of-evidence". In turn, the canonical generalization of Bayesian weights-of-evidence is logistic regression.

That weights of evidence *<sup>W</sup>* agree with the logistic regression parameters in case of joint conditional independence becomes obvious when recalling

$$\begin{split} C\_{\ell'} &= W\_{\ell'}^{(1)} - W\_{\ell'}^{(0)} \\ &= \ln \frac{P(\mathbb{Z}\_{\ell'} = 1 \mid \mathbb{Z}\_0 = 1)}{P(\mathbb{Z}\_{\ell'} = 1 \mid \mathbb{Z}\_0 = 0)} - \ln \frac{P(\mathbb{Z}\_{\ell'} = 0 \mid \mathbb{Z}\_0 = 1)}{P(\mathbb{Z}\_{\ell'} = 0 \mid \mathbb{Z}\_0 = 0)} \\ &= \ln \left( \frac{\mathcal{O}(\mathbb{Z}\_0 = 1 \mid \mathbb{Z}\_{\ell'} = 1)}{\mathcal{O}(\mathbb{Z}\_0 = 1 \mid \mathbb{Z}\_{\ell'} = 0)} \right) = \boldsymbol{\beta}\_{\ell'}, \end{split}$$

which is the log odds ratio, the usual interpretation of (Hosmer and Lemeshow 2000).

If *<sup>Z</sup>* comprises *<sup>m</sup>* dichotomous predictor variables *,* = 1*,*…*, <sup>m</sup>*, there are <sup>2</sup>*<sup>m</sup>* possible different realizations *<sup>z</sup><sup>k</sup>, <sup>k</sup>* = 1*,*…*,* 2*<sup>m</sup>*, of *<sup>Z</sup>*. Then

$$\begin{split} \sum\_{i=1}^{n} \widehat{P}(\mathbb{Z}\_{0} = 1 \mid \mathbf{Z} = \mathbf{z} \,(i)) &= \sum\_{k=1}^{2^{n}} \widehat{P}(\mathbb{Z}\_{0} = 1 \mid \mathbf{Z} = \mathbf{z}\_{k}) \, H(\mathbf{Z} = \mathbf{z}\_{k}) \\ &= \sum\_{k=1}^{2^{n}} \widehat{P}(\mathbb{Z}\_{0} = 1 \mid \mathbf{Z} = \mathbf{z}\_{k}) \, n \, \widehat{P}(\mathbf{Z} = \mathbf{z}\_{k}) \\ &= n \widehat{P}(\mathbb{Z}\_{0} = 1) = \sum\_{i=1}^{n} z\_{0}(i), \end{split}$$

where the last equation is an application of the formula of total probability. It is a constitutive equation to estimate the parameters of a logistic regression model and holds always for fitted logistic regression models. With respect to weights-of-evidence, the test statistic of the so-called "new omnibus test" of conditional independence (Agterberg and Cheng 2002) is

$$\boldsymbol{\mu} = \sum\_{i=1}^{n} \left( \widehat{\boldsymbol{P}} \left( \mathbf{Z}\_0 = 1 \mid \mathbf{Z} = \mathbf{z} \,(i) \right) - z\_0(i) \right)^2$$

and should not be too large for conditional independence to be reasonably assumed.

### **3.5 Hammersley–Clifford Theorem**

Rephrasing the proper statement (Lauritzen 1996) casually, the Hammersley–Clifford Theorem states that a probability distribution with a positive density satisfies one of the Markov properties with respect to an undirected graph *G* if and only if its density can be factorized over the cliques of the graph. Since the distribution of a categorical random vector can be represented in terms of a log-linear model, Hammersley– Clifford theorem applies. Given (*<sup>m</sup>* + 1) random variables 0*,*…*, m*, there is a total of (*<sup>m</sup>*+1 +1 ) different product terms each involving ( + 1) variables, = 0*,*…*, <sup>m</sup>*, summing to a total of <sup>∑</sup>*<sup>m</sup>* =0 (*<sup>m</sup>*+1 +1 ) = 2*<sup>m</sup>*+1 − 1 different terms. Thus there is a total of (*<sup>m</sup>* + 1) single variable terms, and a total of <sup>2</sup>*<sup>m</sup>*+1 − (*<sup>m</sup>* + 2) multi variable terms.

The full log-linear model encompasses all terms and reads

$$\log p\_{\kappa} = \sum\_{\ell=0}^{m} \sum\_{a \in C\_{\ell+1}^{m+1}} \sum\_{\kappa(a)} \phi\_{\kappa(a)} \left. \mathbf{I}\_{s\_{\kappa(a)}} (\mathbf{z}\_{\kappa(a)}) \right| \tag{3.3}$$

where ∈ *Cm*+1 +1 denotes an ( + 1)-combination of the set {1*,*…*, <sup>m</sup>* + 1} *<sup>⊂</sup>* <sup>ℕ</sup>, and ()=(*ki*1 *,*…*, ki*+1 ) denotes a multi-index with ( + 1) entries *ki* = 1*,*…*,Ki* , for = 0*,*…*, <sup>m</sup>*. The random vector *<sup>Z</sup>*() is the product of any tuple of ( + 1) components of *Z*, the total number of which is (*<sup>m</sup>*+1 ) .

+1 Assumptions of independence or conditional independence simplify the distribution of *Z*, i.e., its full log-linear model, considerably. Assuming independence for all its components *,* = 0*,*…*, <sup>m</sup>*, the log-linear model simplifies according to Eq. (3.1) to

$$\log p\_{\kappa} = \sum\_{\ell=0}^{m} \log p\_{k\_{\ell}} = \sum\_{\ell=0}^{m} \sum\_{k\_{\ell}=1}^{K\_{\ell}} \phi\_{k\_{\ell}} \mathbf{1}\_{\{s\_{k\_{\ell}}\}}(z\_{\ell'}),\tag{3.4}$$

where *<sup>k</sup>* = log *pk*

.

Assuming joint conditional independence of all components *,* = 1*,*…*, <sup>m</sup>*, given 0, the log-linear model, Eq. (3.3), simplifies according to Eq. (3.1) to

$$\log p\_{\kappa} = \sum\_{\ell'=0}^{m} \sum\_{k\_{\ell}=1}^{K\_{\ell}} \phi\_{k\_{\ell'}} \mathbf{1}\_{\{s\_{k\_{\ell'}}\}}(\mathbf{z}\_{\ell'}) + \sum\_{\ell'=1}^{m} \sum\_{a \in \{0, \ell'\}} \sum\_{\kappa(a)} \phi\_{\kappa(a)} \mathbf{1}\_{\{s\_{\kappa(a)}\}}(\mathbf{z}\_{\kappa(a)}).\tag{3.5}$$

Thus the latter model, Eq. (3.5), assuming conditional independence differs from the model for independence, Eq. (3.4), in the additional product terms 0 *<sup>⊗</sup> ,* <sup>=</sup> 1*,*…*, <sup>m</sup>*.

Any violation of joint conditional independence given 0 results in additional cliques of the graph and in additional product terms. Assuming that conditional independence given <sup>0</sup> does not hold for a particular subset 1 *,*…*, <sup>Z</sup><sup>k</sup>* of variables results in an enlarging of the log-linear model of Eq. (3.5) by additional terms referring to 0 *<sup>⊗</sup>* <sup>⨂</sup>*<sup>k</sup>* =1 <sup>⨂</sup>*i*∈*Ck <sup>i</sup>* and <sup>⨂</sup>*<sup>k</sup>* =1 <sup>⨂</sup>*i*∈*Ck i* , respectively.

### **3.6 Testing Joint Conditional Independence of Categorical Random Variables**

The statistic of the likelihood ratio test (Neyman and Pearson 1933; Casella and Berger 2001) is the ratio of the maximized likelihood of a restricted model and the maximized likelihood of the full model. The assumption of the likelihood ratio test concerns the choice of the model family of distributions.

The null-hypothesis is that a given log-linear model is sufficiently large to represent the joint distribution. If the random variables are categorical, the full log-linear model is always sufficiently large as was explicitly shown above. More interesting are tests whether a smaller log-linear model is sufficiently large. Testing the nullhypothesis whether a log-linear model encompassing one-variable and two-variable terms, all of which involve 0, is sufficiently large provides a test of conditional independence of all *,* = 1*,*…*, <sup>m</sup>*, given 0 because this log-linear model is sufficiently large in case of conditional independence given 0. Thus, a reasonable rejection of the initial null-hypothesis implies a reasonable rejection of the assumption of conditional independence given 0.

### **3.7 Conditional Distribution, Logistic Regression**

Since the joint distribution implies all marginal and conditional distribution, respectively, the conditional distribution

$$P\_{\mathbb{Z}\_{\mathbb{Q}}|\bigotimes\_{\ell=1}^{n}\mathbb{Z}\_{\ell}} = \frac{P\_{\bigotimes\_{\ell=0}^{n}\mathbb{Z}\_{\ell}}}{P\_{\bigotimes\_{\ell=1}^{n}\mathbb{Z}\_{\ell}}} \tag{3.6}$$

is explicitly given here by

$$\frac{P\_{\bigotimes\_{\ell=0}^{n}\mathbb{Z}\_{\ell}}(s\_{k\_{0}},\ldots,s\_{k\_{\ell}})}{P\_{\bigotimes\_{\ell=1}^{n}\mathbb{Z}\_{\ell}}(s\_{k\_{1}},\ldots,s\_{k\_{\ell}})} = \frac{P\_{\bigotimes\_{\ell=0}^{n}\mathbb{Z}\_{\ell}}(s\_{k\_{0}},\ldots,s\_{k\_{\ell}})}{\sum\_{k\_{0}=1}^{K\_{0}}P\_{\bigotimes\_{\ell=0}^{n}\mathbb{Z}\_{\ell}}(s\_{k\_{0}},s\_{k\_{1}},\ldots,s\_{k\_{\ell}})}.$$

Assuming independence, Eq. (3.6) immediately reveals

$$P\_{\mathbb{Z}\_0|\bigotimes\_{\ell=1}^n \mathbb{Z}\_\ell} = P\_{\mathbb{Z}\_0}.$$

Assuming conditional independence of all *,* = 1*,*…*, <sup>m</sup>*, given 0 and further that 0 is dichotomous, then

$$P\_{\mathbb{Z}\_0|\bigotimes\_{\ell=1}^m \mathbb{Z}\_{\ell}}(1 \mid s\_{k\_1}, \dots, s\_{k\_{\ell}}) = \frac{P\_{\bigotimes\_{\ell=0}^m \mathbb{Z}\_{\ell}}(1, s\_{k\_1}, \dots, s\_{k\_{\ell}})}{\sum\_{i=0}^1 P\_{\bigotimes\_{\ell=0}^m \mathbb{Z}\_{\ell}}(i, s\_{k\_1}, \dots, s\_{k\_{\ell}})} \tag{3.7}$$

with

$$P\_{\bigotimes\_{\ell=0}^{m} \mathbb{Z}\_{\ell}}(1, s\_{k\_1}, \dots, s\_{k\_{\ell}}) = \exp\left(\phi\_1 + \sum\_{\ell=1}^{m} \phi\_{k\_{\ell}} + \sum\_{\ell'=1}^{m} \sum\_{k\_{\ell}=1}^{K\_{\ell}} \phi\_{1, k\_{\ell'}}\right),$$

and

$$\sum\_{i=0}^{1} P\_{\bigotimes\_{\ell=0}^{m} \mathbb{Z}\_{\ell}}(i, s\_{k\_1}, \dots, s\_{k\_{\ell}}) = \sum\_{i=0}^{1} \exp\left(\phi\_i + \sum\_{\ell=1}^{m} \phi\_{k\_{\ell}} + \sum\_{\ell=1}^{m} \sum\_{k\_{\ell}=1}^{K\_{\ell}} \phi\_{i, k\_{\ell}}\right)$$

Thus,

$$\begin{split} & \frac{P\_{\bigotimes\_{\ell=0}^{n}\mathbb{Z}\_{\ell}\,}(1, s\_{k\_{1}}, \dots, s\_{k\_{\ell}})}{\sum\_{s=0}^{1}P\_{\bigotimes\_{\ell=0}^{n}\mathbb{Z}\_{\ell}\,}(s, s\_{k\_{1}}, \dots, s\_{k\_{\ell}})} \\ & \quad = \frac{\exp\left(\phi\_{1}\,\mathbbm{1}\_{\{1\}}(1) + \sum\_{\ell'=1}^{m}\phi\_{1,k\_{\ell}}\,\mathbbm{1}\_{\{1,s\_{k\_{\ell}}\}}(1, \mathbb{Z}\_{\ell})\right)}{\sum\_{s=0}^{1}\exp\left(\phi\_{s}\,\mathbbm{1}\_{\{s\_{\ell}\}}(1) + \sum\_{\ell'=1}^{m}\phi\_{k\_{\ell'}}\,\mathbbm{1}\_{\{s,s\_{k\_{\ell'}}\}}(1, \mathbb{Z}\_{\ell'})\right)} \\ & \quad = \frac{\exp\left(\phi\_{1} + \sum\_{\ell=1}^{m}\phi\_{1,s\_{k\_{\ell}}}\,\mathbbm{1}\_{\{s\_{k\_{\ell'}}\}}(\mathbb{Z}\_{\ell})\right)}{1 + \exp\left(\phi\_{1} + \sum\_{\ell=1}^{m}\phi\_{1,k\_{\ell}}\,\mathbbm{1}\_{\{s\_{k\_{\ell'}}\}}(\mathbb{Z}\_{\ell})\right)} \\ & \quad = A\Big(\phi\_{1} + \sum\_{\ell'=1}^{m}\phi\_{1,k\_{\ell}}\,\mathbbm{1}\_{\{s\_{k\_{\ell'}}\}}(\mathbb{Z}\_{\ell'})\Big). \end{split}$$

Finally,

$$P\_{\mathbb{Z}\_0|\otimes\_{\ell=1}^m \mathbb{Z}\_{\ell}} = A\left(\beta\_0 + \sum\_{\ell=1}^m \beta\_{\ell} \mathbb{Z}\_{\ell}\right),$$

which is obviously logistic regression

$$\log \text{it}P\_{\mathbb{Z}\_0|\bigotimes\_{\ell=1}^m \mathbb{Z}\_\ell} = \beta\_0 + \sum\_{\ell=1}^m \beta\_{\ell} \mathbb{Z}\_{\ell}.\tag{3.8}$$

It should be noted that additional product terms in the joint probability *P*⨂*<sup>m</sup>* =0 on the right hand side of Eq. (3.7) of the form <sup>⨂</sup>*<sup>k</sup>* =1 <sup>⨂</sup>*i*∈*Ck <sup>i</sup>* including *,* <sup>=</sup> <sup>1</sup>*,*…*, <sup>m</sup>*, only, i.e., not including 0, would not effect the form of the conditional probability, Eq. (3.8). Additional product terms of the form 0 *<sup>⊗</sup>* <sup>⨂</sup>*<sup>k</sup>* =1 <sup>⨂</sup>*i*∈*Ck i* , i.e., including 0, result in a logistic regression model with interaction terms, Eq. (3.2).

Ordinary logistic regression is optimum, if the joint probability of the (dichotomous) target variable and the predictor variables is of log-linear form and all predictor variables are jointly conditionally independent given the target variable; in particular, it is optimum if the predictor variables are categorical and jointly conditionally independent given the target variable (Schaeben 2014a). Logistic regression with interaction terms is optimum, if the joint probability of the (dichotomous) target variable and the predictor variables is of log-linear form and the interaction terms correspond to lacking conditionally independence given the target variable; for categorical predictor variables, interaction terms can compensate for any lack of conditional independence exactly. Logistic regression with interaction terms is optimum in case of lacking conditional independence (Schaeben 2014a).

### **3.8 Practical Applications**

The practical application of the log-likelihood ratio test of joint conditional independence generally includes the following steps


### *3.8.1 Practical Application with Fabricated Indicator Data*

#### **3.8.1.1 The Data Set** BRY

The data set BRY is derived from the https://en.wikipedia.org/wiki/Conditional\_ independence. Initially it comprises three random events *B*, *R*, *Y*, denoting the subsets of the set of all 49 pixels which are blue, red or yellow with given probabilities *<sup>P</sup>*(*B*) = <sup>18</sup> 49 = 0*.*367*, <sup>P</sup>*(*R*) = <sup>16</sup> 49 = 0*.*326*, <sup>P</sup>*(*Y*) = <sup>12</sup> 49 = 0*.*244. The random events

**Fig. 3.1** Map images of random events *<sup>B</sup>, <sup>R</sup>, <sup>Y</sup>*.

*<sup>B</sup>, <sup>R</sup>, <sup>Y</sup>* are distinguished from their corresponding random indicator variables *, ,* defined as usually, e.g.,

$$\mathsf{B}(o) = \mathsf{I}\_{\mathcal{B}}(o), o \in \mathcal{Q},$$

where 1I denotes the indicator variable. They are assigned to pixels of a 7×7 digital map image, Fig. 3.1.

It should be noted that in this example any spatial references are solely owed to the purpose of visualization as map images, and that the test itself does not take any spatial references or spatially induced dependences into account.

Checking independence according to its definition in reference to random events, the figures

$$P(B \cap R) = 0.122, \quad P(B) \, P(R) = 0.119$$

indicate that the random events *B* and *R* are not independent. However, the deviation is small.

Next, conditional independence is checked in terms of its definition referring to random events. Since conditional independence of the random events *B* and *R* given *Y* does not imply conditional independence of the random events *B* and *R* given the complement ∁*Y*, two checks are required. The results are

$$\begin{aligned} P(B \cap R \mid Y) &= \frac{1}{6} = P(B \mid Y) \; P(R \mid Y) \\ P(B \cap R \mid \mathbb{C}Y) &= \frac{4}{37} \neq \left(\frac{12}{37}\right)^2 = P(B \mid \mathbb{C}Y) \; P(R \mid \mathbb{C}Y), \end{aligned}$$

and indicate that the random events *B* and *R* are conditionally independent given the random event *Y*, but that they are not conditionally independent given the complement ∁*Y*. It should be noted that the deviation of the joint conditional probability and the product of the two individual conditional probabilities in terms of their ratio is 1.027. In fact, the events *<sup>B</sup>* and *<sup>R</sup>* are conditionally independent given either *<sup>Y</sup>* or ∁*<sup>Y</sup>* if one white pixel, e.g. pixel (1,7) with = = = 0, is omitted.

Generalizing the view to random variables *, ,* and their unique joint realization as shown in Fig. 3.1, Pearson's <sup>2</sup> test with Yates' continuity correction of the null-hypothesis of independence of the random variables and given the data returns a *p*-value of 1 indicating that the null-hypothesis cannot reasonably be rejected.

The likelihood ratio test is applied with respect to the log-linear distribution corresponding to the null-hypothesis of conditional independence and results in a *p*-value of 0*.*996 indicating that the null-hypothesis cannot reasonably be rejected.

Thus, given the data the tests suggest to infer that the random variables and are independent and conditionally independent given the random variable .

#### **3.8.1.2 The Data Set** SCCI

The next data set SCCI comprises three random events *B, <sup>B</sup>*2*, <sup>T</sup>* with given probabilities *<sup>P</sup>*(*B*1) = *<sup>P</sup>*(*B*2) = *<sup>P</sup>*(*T*) = <sup>7</sup> <sup>=</sup> <sup>7</sup> = 0*.*142. They are assigned to pixels of a 7×7 digital map image, Fig. 3.2.

Checking independence according to its definition for random events, the figures

$$P(B\_1 \cap B\_2) = 0.102, \quad P(B\_1)P(B\_2) = 0.0201$$

indicate that the random events *B*and *<sup>B</sup>*2 are not independent.

Next, conditional independence is checked in terms of its definition referring to random events. Since conditional independence of the random events *B* and *<sup>B</sup>*2 given *T* does not imply conditional independence of the random events *B* and *<sup>B</sup>*2 given ∁*T*, two checks are required. The results are

$$\begin{aligned} P(B\_1 \cap B\_2 \mid T) &= 0.714 \neq \quad 0.734 = P(B\_1 \mid T) \; P(B\_2 \mid T) \\ P(B\_1 \cap B\_2 \mid \mathbb{C}T) &= 0 \neq \; 0.0005 = P(B\_1 \mid \mathbb{C}T) \; P(B\_2 \mid \mathbb{C}T), \end{aligned}$$

**Fig. 3.2** Map images of random events *B, <sup>B</sup>*2*, <sup>T</sup>* with *<sup>P</sup>*(*B*1) = *<sup>P</sup>*(*B*2) = *<sup>P</sup>*(*T*) = <sup>7</sup> <sup>=</sup> <sup>7</sup> = 0*.*142.

and indicate that the random events *B*1 and *<sup>B</sup>*2 are neither conditionally independent given the random event *<sup>T</sup>* nor given the complement ∁*T*.

Testing the null-hypothesis of independence of the random variables 1 and 2 with Pearson's <sup>2</sup> test with Yates' continuity correction given the data returns a *<sup>p</sup>*value of practically equal to 0 indicating that the null-hypothesis should be rejected. The likelihood ratio test is applied with respect to the log-linear distribution corresponding to the null-hypothesis of conditional independence and results in a *p*-value of 0*.*825 indicating that the null-hypothesis cannot reasonably be rejected.

Thus, given the data the tests imply that the random variables 1 and 2 are not independent but conditionally independent given the random variable .

### **3.9 Discussion and Conclusions**

Since pairwise conditional independence does not imply joint conditional independence, the 2-test (Bonham-Carter 1994) of independence given 0 = 1 does not apply to checking the modeling assumption of weights-of-evidence. The disadvantage of both the "omnibus" test (Bonham-Carter 1994) and the "new omnibus" test (Agterberg and Cheng 2002) is twofold. First, it involves an assumption of normal distribution which itself should be subject to a test. Second, weights-of-evidence has to be applied to calculate the test statistic which is the sum of all predicted conditional probabilities within the training data set. If the test actually suggests rejection of the null-hypothesis of conditional independence, the user learns that the application of weights-of-evidence was not mathematically authorized to predict the conditional probabilities. The standard likelihood ratio test suggested here resolves both shortcomings.

**Acknowledgements** The author would like to thank Prof. Juanjo Egozcue, UPC Barcelona, Spain, and Prof. K. Gerald van den Boogaart, HIF, Germany, for emphatic and instructive discussions of conditional independence.

### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **Chapter 4 Modelling Compositional Data. The Sample Space Approach**

**Juan José Egozcue and Vera Pawlowsky-Glahn**

**Abstract** Compositions describe parts of a whole and carry relative information. Compositional data appear in all fields of science, and their analysis requires paying attention to the appropriate sample space. The log-ratio approach proposes the simplex, endowed with the Aitchison geometry, as an appropriate representation of the sample space. The main characteristics of the Aitchison geometry are presented, which open the door to statistical analysis addressed to extract the relative, not absolute, information. As a consequence, compositions can be represented in Cartesian coordinates by using an isometric log-ratio transformation. Standard statistical techniques can be used with these coordinates.

**Keywords** Compositional data analysis ⋅ Aitchison geometry Simplex ⋅ Variation matrix ⋅ Biplot ⋅ Balance dendrogram ⋅ ilr ⋅ clr

**AMS Subjet classifications** 62-07 ⋅ 62-02

### **4.1 Introduction**

The difficulties when dealing with compositional data have been known for more than a century. Indirectly, Pearson (1897) described some of these problems and coined the term spurious correlation. They are easily illustrated using the early characterizations of compositional data, which relay on the constant sum constraint (CSC). For instance, Chayes (1960, 1962) and Connor and Mosimann (1969) based

J. J. Egozcue (✉)

Department of Civil and Environmental Engineering, Universidad Politécnica de Cataluña, Barcelona, Spain e-mail: juan.jose.egozcue@upc.edu

V. Pawlowsky-Glahn Department of Computer Science, Applied Mathematics and Statistics, University of Girona, Girona, Spain e-mail: vera.pawlowsky@udg.edu

their analysis on the fact that a vector of proportions = (*x*1*, x*2*,*…*, xD*) satisfies the CSC,

$$\sum\_{i=1}^{D} \mathbf{x}\_i = \kappa > 0, \quad \mathbf{x}\_i > 0, \quad i = 1, 2, \dots, D. \tag{4.1}$$

It defines the -simplex of *<sup>D</sup>* components or parts. Here the simplex is denoted *<sup>D</sup>*, with no reference to the positive constant . Data fulfilling the CSC were called *constrained or closed data*. In the eighties, promoted by J. Aitchison, this kind of data were recognized as *compositional data* (Aitchison and Shen 1980; Aitchison 1982, 1986). In the last reference, additional conditions were added to the original CSC characterisation, leading to the formulation of some principles for compositional data analysis. They were the starting point on which the log-ratio approach to compositional data is based. These principles have been reformulated several times in order to depurate and to clarify them for users (Aitchison and Egozcue 2005; Egozcue 2009; Egozcue and Pawlowsky-Glahn 2011a; Pawlowsky-Glahn et al. 2015). Nonetheless, they have been contested from different points of view (e.g. Scealy and Welsh 2014), arguing that they match the conditions for the application of log-ratio methods. But not all data satisfying the CSC (4.1), for instance admitting that some parts can be zero, are automatically adequate for a log-ratio analysis. In the last decade, in which the log-ratio approach has shown to be useful in a large number of applications, it also became clear that it can be rigorously applied to problems in which the CSC is not fulfilled, or where the components do not represent proportions. The key point for this change of the paradigm represented by the CSC, is the conception of compositions as equivalence classes of vectors which positive components are proportional (Barceló-Vidal et al. 2001; Martín-Fernández et al. 2003; Pawlowsky-Glahn et al. 2015; Barceló-Vidal and Martín-Fernández 2016), and the related idea that the simplex is just a representation of the sample space of compositions. This fact is a direct consequence of the scale invariance of compositions (Aitchison 1986) but, up to now, its implications have not been completely recognised.

This contribution aims at a reformulation of the principles of compositional data analysis in their log-ratio version, presenting them as a practical and natural need in many situations of data analysis. Section 4.2 discusses scale invariance and compositional equivalence and Sect. 4.3 presents the simplex as an appropriate sample space for compositional data. Perturbation, the group operation between compositions, is shown to be a natural operation in Sect. 4.4. The Aitchison distance and the requirements on it are discussed in Sect. 4.5. The consequence of the previous sections is the Euclidean space structure of the simplex, which has been termed Aitchison geometry (Pawlowsky-Glahn and Egozcue 2001). The Aichison geometry has been shown to be useful for the modelling and analysis of compositions, centring the interest in the relative information contained in the data. Some of these elements are commented in Sect. 4.6.

### **4.2 Scale Invariance, Key Principle of Compositions**

When somebody records the composition of a product, material, shares of a market, species in an ecosystem or a kitchen recipe, he or she implicitly recognizes that the total amount is irrelevant for the description of the product, material, shares, species or recipe. This does not mean that the size or the amount is not informative, it only tells us that, whichever is the size, the elements of the total are distributed according to the specified composition. Essential are the ratios between the components of the described system. One can say that for any system that can be decomposed into parts its description has, at least, two types of information: one that is referred to as *size*, and another one that concerns the relations between the parts irrespective of the size. This latter one is called compositional information and, when the system is a geometric object, it is called *shape*. Beyond size (total amount) and composition (shape), there may be other properties of the system which can be quantified (color, sound, complexity, strength, ... ) and again these additional properties may be decomposed into size and composition. Here, attention is paid to systems which are formed by parts, while their size or total amount is either analysed in another way or is irrelevant. For a discussion of a possible approach to a problem where interest lies in the relative information and in the total, see Pawlowsky-Glahn et al. (2015), Olea et al. (2016), Ferrer-Rossell et al. (2016).

Think about the map of a region; even changing the scale of the map, the same region is identified. If the distance between two mountain peaks was 12 cm, and a lake between the two was 4 cm broad, halving the scale new lengths of 6 and 2 cm will be obtained. The distance between the two peaks and the width of the lake can be identified as equal in the two maps, as the ratio is in both cases 12∕4 = 6∕2 = 3. Only when the maps are to be transformed into an actual region, the size becomes relevant and it is revealed taking into account the scale of the maps. Note that in the case of the peaks and the lake, the considered parts, the distance between peaks and the width of the lake, are not disjoint, as the first includes the second. In fact, the previous comments did not imply that the parts of the system had to be nonoverlapping or disjoint.

The irrelevance of the total led J. Aitchison (1986) to introduce the principle of scale invariance for compositions. A composition is assumed to be represented by an array of positive numbers which quantitatively represent the parts of the system. Let = (*x*1*, x*2*,*…*, xD*), *xi >* 0 for *i* = 1*,* 2*,*…*, D*, be such a composition. Consider any positive constant *c >* 0. The scale invariance principle can be stated as: and *c* contain the same compositional information. From this point of view, *compositional equivalence* can be defined (Aitchison 1997; Barceló-Vidal et al. 2001; Barceló-Vidal and Martín-Fernández 2016; Pawlowsky-Glahn et al. 2015).

**Definition 4.2.1** (*Compositional equivalence*) Let = (*x*1*, x*2*,*…*, xD*) and = (*y*1*, y*2*,*…*, yD*) be two arrays of *D* positive components. They are compositionally equivalent if there exists a positive constant *c* such that, for *i* = 1*,* 2*,*…*, D*, *yi* = *cxi* .

Two equivalent arrays , represent the same composition. Both the equivalence class generated and its representative are called compositions.

Figure 4.1 shows some artificial, arbitrary data of Ca and Mg in mg/l from a fictitious water analysis (circles). Each pair (Ca,Mg) can be considered as a two part composition. A line from the origin through each data point consists of compositionally equivalent points, thus visualising a composition, strictly speaking an equivalence class. Any point on these rays can be chosen as a representative of the composition. Particularly, they can be selected so that the sums of the two components add to 100, which correspond to the triangles on the 2-part simplex (full line). This means that compositions are equivalence classes of compositionally equivalent arrays. Equivalence classes are handled by selecting a representative of each class and operating with these representatives. The selection of representative of a class is arbitrary, but imposes a condition on any further analysis. This condition is the principle of scale invariance formulated in Aitchison (1986).

**Principle 4.2.1** (Scale invariant analysis) *Any analysis or operation with compositions must be expressed by scale invariant functions of the components. Scale invariant functions are identified with real,* 0*-degree homogeneous functions, that is, satisfying the condition f*() = *f*(*c*) *for any positive constant c and for any composition .*

Consequently, for any composition given by the array it is possible to choose another compositionally equivalent array, denoted , such that it is in the simplex, that is, it fulfills the CSC (4.1). To this end, the constant in CSC (4.1) = 1 is chosen, thus yielding

$$\mathcal{C}\mathbf{x} = \left(\frac{\mathbf{x}\_1}{\sum\_{i=1}^D \mathbf{x}\_i}, \frac{\mathbf{x}\_2}{\sum\_{i=1}^D \mathbf{x}\_i}, \dots, \frac{\mathbf{x}\_D}{\sum\_{i=1}^D \mathbf{x}\_i}\right).$$

**Fig. 4.1** Some two-component data points with positive components (circles), are compositionally equivalent to all points on the dashed lines from the origin through the data points. Triangles are the representatives of each equivalence class on the 2-part simplex in which components add to 100

The symbol is called closure operator. It assigns a representative in the simplex (closed form of , satisfying the CSC) to the equivalence class where is included. Due to the scale invariance analysis principle, any analysis on the elements in the simplex (closed) must lead to identical results as that performed using the non-closed representatives.

The scale invariance principle is familiar to any scientist. For instance, an array of probabilities as (0*.*1*,* 0*.*3*,* 0*.*2), originally expressed as values between 0 and 1, can be expressed in percentages as (10*,* 30*,* 20) without any confusion; a set of concentrations given in percentages of mass can be translated into ppm (parts per million of mass) just multiplying by 10*,* 000 and the geologist does not get confused provided that he/she is informed about which units are in use.

Despite the intuitive character of the scale invariance principle, in practice it is frequently violated. For instance, when performing a cluster analysis of geochemical samples given in ppm using the Euclidean distance between the samples. In fact, assume that we have two samples and , and the square distance between them is taken as the square-Euclidean distance d2(*,* ) = <sup>∑</sup>*<sup>D</sup> <sup>i</sup>*=1(*xi* − *yi* ) <sup>2</sup>. Imagine that is now expressed in ppb (parts per billion). This is a valid operation as in ppm and in ppb are compositionally equivalent, but d2(*,* ) changes dramatically as the square-differences (*xi* − *yi* ) <sup>2</sup> become (*xi* − 1000 ⋅ *yi* ) <sup>2</sup> which constitutes a violation of the scale invariance principle.

Similarly, given a set of geochemical samples in ppm, 1, 2,..., *n*, the Pearson correlation coefficient between two components also violates the principle of scale invariance. This coefficient between *x*⋅<sup>1</sup> and *x*⋅<sup>2</sup> is

$$r\_{12} = \frac{\sum\_{j=1}^{n} (\mathbf{x}\_{j1} - \bar{\mathbf{x}}\_1)(\mathbf{x}\_{j2} - \bar{\mathbf{x}}\_2)}{\sqrt{\sum\_{j=1}^{n} (\mathbf{x}\_{j1} - \bar{\mathbf{x}}\_1)^2 \sum\_{j=1}^{n} (\mathbf{x}\_{j2} - \bar{\mathbf{x}}\_2)^2}},\tag{4.2}$$

where *x̄<sup>k</sup>* is the average of the *k*-th component along the sample. Now suppose that the first sample <sup>1</sup> is expressed in ppb. This should not change the analysis as preconized by the scale invariance principle. However, everything changes: the average values *x̄<sup>k</sup>* = (1∕*n*) ∑*n <sup>j</sup>*=1 *xjk* are now dominated by the first term 1000 ⋅ *x*1*<sup>k</sup>* which replaced the initial term *x*1*<sup>k</sup>*. The global effect is evident after a simple inspection of Eq. (4.2). When the change of closure affects all the samples, the effect is the *spurious correlation* studied by Chayes (1960), although without any successful solution. Nowadays, after J. Aitchison's work, spurious correlation just corresponds to a violation of the scale invariance principle. Or, in other words, if a data set is assumed scale invariant, covariance or Pearson correlation are meaningless and spurious, and should not be used.

### **4.3 The Simplex as Sample Space of Compositions**

In any data analysis, the first modeling step is to establish an appropriate sample space. In general, this step conditions all subsequent steps, and may affect dramatically the conclusions. Dealing with compositional data is not an exception. However, the choice and structure of the sample space is usually not explicit, and its consequences remain hidden in practice. Even the analyst is frequently not aware of the choice he or she has made when taking a decision on which methodology to apply.

The sample space of an observation (variable, vector, function or, in general, object) is a set where all the possible outcomes can be represented. However, the sample space may contain elements which do not correspond to any possible observation. When the considered object is a random one, the sample space must contain subsets, called events, which can be assigned a probability. Technically, if is the sample space, a -field in (e.g. Ash 1972; Feller 1968) needs to be defined. This is the minimum structure of a sample space for a random object. There are many qualitatively different random objects in practice. Multivariate real random *d*vectors may be thought of as taking values in real space ℝ*<sup>d</sup>*; a discrete time, real valued stochastic process, can be represented in the space <sup>∞</sup> of all real, bilaterally bounded sequences; if the observation is a random set on a plane, like paint stains on the floor, the sample space can be the set of compact sets in the plane; there are many more examples. It should be noted that the sample space is a choice of the analyst and it must be selected according to the stated questions from the beginning of the analysis. Commonly, beyond probability statements, the data analysis requires performing operations (sums, differences, averages, scaling), metric computations (distances or divergences, projections, approximations), or computing functionals (averages of components, extraction of extremes). All these procedures must be defined on the sample space. Consequently, the structure of the sample space is richer than that provided by the -field of events.

When dealing with *D*-part compositional data, the simplex *<sup>D</sup>* as the sample space is a valid choice, given that any composition can be assigned a representative in it. However, there are many alternatives. Figure 4.1 suggests that any curve intersecting once, and only once, all rays from the origin in the positive orthant might be taken as sample space. For instance, for two dimensional data points like those shown in Fig. 4.1, a possible choice is a quarter of a circumference, or two segments completing a square with the axes, as shown in Fig. 4.2. In the case of compositional data, the analyst is mainly interested in proportions and ratios, thus suggesting the choice of the simplex as an appropriate and intuitive representation. However, a key point for the choice of an adequate sample space is the decision on which is a translation or shift relevant for the analysis.

### **4.4 Perturbation, a Natural Shift Operation on Compositions**

Perturbation, as operation in the simplex, was introduced by Aitchison (1986) on an intuitive basis. It can be stated as follows.

**Definition 4.4.1** (*perturbation*) Let , be two elements in the *D*-part simplex *<sup>D</sup>*, = (*x*1*, x*2*,*…*, xD*), = (*y*1*, y*2*,*…*, yD*). The perturbation between them is

$$\mathbf{x} \oplus \mathbf{y} = \mathcal{C}(\mathbf{x}\_1 \mathbf{y}\_1, \mathbf{x}\_2 \mathbf{y}\_2, \dots, \mathbf{x}\_D \mathbf{y}\_D) \,. \tag{4.3}$$

Some properties of perturbation are quite immediate. They can be summarized as that perturbation is a commutative group operation in *<sup>D</sup>* (Aitchison 1997). The neutral element is the composition with equal components <sup>=</sup> (1*,* <sup>1</sup>*,*…*,* 1). The opposite to is

*<sup>⊖</sup>* <sup>=</sup> ((1∕*x*1)*,* (1∕*x*2)*,*…*,*(1∕*xD*))*,*

where each component is inverted.

Repeated perturbation, like *⊕ ⊕* , suggests the definition of a multiplication by a real scalar, so that *⊕ ⊕* = 3 *⊙* . Following this idea, multiplication by real scalars, called powering, is defined as follows.

**Definition 4.4.2** (*powering*) Let = (*x*1*, x*2*,*…*, xD*) be an element in the *D*-part simplex *<sup>D</sup>* and let be a real scalar. The powering of by is

$$a \odot \mathbf{x} = \mathcal{C}(\mathbf{x}\_1^a, \mathbf{x}\_2^a, \dots, \mathbf{x}\_D^a) \,. \tag{4.4}$$

These definitions present perturbation and powering as operations on elements of the simplex. However, as the simplex can be taken as the sample space of compositions and its elements are representatives of compositions, perturbation and powering are also operations on compositions. The simplex, endowed with perturbation and powering is a (*D* − 1)-dimensional vector space. Perturbation plays the role of the sum in real space, and powering is multiplication by a real scalar. Perturbing a composition by another composition is thus a shift of in the direction of .

Despite the mathematical aspect of Definition 4.4.1, perturbation is a common place in real life and scientific activity. To begin with, imagine a water filtering device which is fed with an inflow with disolved matter characterised by the concentrations (mg/l) of the major ions specified in Table 4.1, first row. Suppose that the filtering device has been designed to filter out sulphur, SO4, iron, Fe, and phosphorus, P; SO<sup>4</sup> is ideally reduced by 75%, Fe by 10%, and P by 5%, meanwhile other ions remain unaltered. In order to compute the outflow concentrations, the filter factor or transfer function (4th row) is computed as 1 − (10∕100) = 0*.*9 in the case of Fe. Then, the filter factor multiplies the inflow concentrations to obtain the outflow concentrations in mg/l. Notably, when the inflow concentrations are represented in closed form, as percentages (second row), then, once multiplied by the filter factor, the same outflow concentrations in percent are obtained. In fact, the outflow concentrations in mg/l, when closed to 100, are those in the last row of the table. The closed form of the filter factor, labelled filter perturbation, can be used to obtain the same outflow concentrations. That is the filter factor is a composition. Although elementary, this example shows that inflow and outflow concentrations and the filter factor can be represented by different, but compositionally equivalent, arrays; and that the traditional form of expressing change of concentrations by percentages is nothing else than a way of expressing a perturbation. Also, one may be confronted with the estimation of the filter factor (perturbation) from the inflow and outflow concentrations. From the example, it is clear that a ratio of outflow over inflow concentrations gives a factor compositionally equivalent to the filter perturbation. This suggests the

**Table 4.1** Inflow concentrations of some ions disolved in water are filtered reducing Fe, SO<sup>4</sup> and P by a given percentage. Outflow concentrations are obtained by multiplication of inflow concentration by the filter factor (closed or not). Inflow, outflow concentrations and filter factor are presented also in closed form as they are treated as compositions


definition of the difference-perturbation, the opposite operation to perturbation, as

$$\mathbf{y} \oplus \mathbf{x} = \mathcal{C}\left(\frac{\mathbf{y}\_1}{\mathbf{x}\_1}, \frac{\mathbf{y}\_2}{\mathbf{x}\_2}, \dots, \frac{\mathbf{y}\_D}{\mathbf{x}\_D}\right),$$

which is the natural difference for perturbation as a group operation.

In the context of probability theory, arrays of probabilities can be considered as compositions. Consider a family of non overlapping events *Ai* , *i* = 1*,* 2*,*…*, D,* which are assigned probabilities *pi* = P[*Ai* ]. Observing the result *R* of an experiment, the conditional probabilities *qi* = P[*R*|*Ai* ] allow to update the probabilities *pi* —according to the information obtained from the observation *R*— using Bayes' formula

$$\mathbb{P}[A\_i|R] = \frac{\mathbb{P}[A\_i] \cdot \mathbb{P}[R|A\_i]}{\sum\_{j=1}^{D} \mathbb{P}[A\_j] \cdot \mathbb{P}[R|A\_j]} = \mathcal{C} \left(\mathbf{p} \oplus \mathbf{q}\right) \,, \perp$$

where = (*p*1*, p*2*,*…*, pD*) and = (*q*1*, q*2*,*…*, qD*). Bayes' formula states that the final probabilities, conditioned to the result *R*, are the perturbation of the initial or prior probabilities and the probabilities of the result given the events *Ai* , denoted *qi* , also known as the likelihood of *R*. In this way perturbation becomes a very natural way of operating vectors of probabilities and likelihood, as it is the paradigm of incorporating information from observations. This interpretation of perturbation was proposed in Aitchison (1986, 1997) and developed in other contexts (Egozcue and Pawlowsky-Glahn 2011b; Egozcue et al. 2013).

Perturbation also appears as a natural operation on compositions when changing units. For instance, consider a grain size distribution for different sieve diameters. It may be expressed as proportions of volume corresponding to each sieve or as proportions of mass assigned to the same sieves. Both distributions can be considered as compositions. Transforming volume to mass consists of multiplication by the density of the material in each sieve, possibly different from one sieve to the other. This componentwise multiplication is a perturbation (Parent et al. 2012). Also, changing the concentrations of chemical elements from mg/kg to molar concentration consists of dividing each component by its molar mass, thus performing a perturbation. In all these examples, the secondary role of the closure and the CSC is remarkable: closure might only be necessary to facilitate interpretation.

Exponential decay of mass is frequent in nature. The typical example is the decay of mass of radioactive isotopes in time. These type of processes describe straight lines in the simplex (Egozcue et al. 2003; Pawlowsky-Glahn et al. 2015; Tolosana-Delgado 2012). This supports that perturbation is a natural operation in the simplex and between compositions. To sketch the argument, consider the masses of *D* = 3 fictitious radioactive isotopes (*t*)=(*x*1(*t*)*, x*2(*t*)*, x*3(*t*)), which decay rates in time are <sup>1</sup> = 3, <sup>2</sup> = 0*.*5, <sup>3</sup> = 0*.*1, respectively. Initially, at *t* = 0, there are masses (0) = (0*.*9*,* 0*.*04*,* 0*.*01) which disintegrate into other non considered isotopes. The total mass decreases in time, and the mass of each isotope changes as

**Fig. 4.3** Evolution of masses (left panel) and proportions (right panel) of three isotopes which disintegrate at rates 3*,* 0*.*5*,* 0*.*1 in time, respectively. Initial masses are 0*.*9*,* 0*.*04*,* 0*.*01

*xi* (*t*) = *xi* (0) ⋅ exp[−*<sup>i</sup> t*] *, i* = 1*,* 2*,* 3 *.* (4.5)

This evolution of mass is shown in Fig. 4.3, left panel, where the decreasing mass is clearly observed. Figure 4.3, right panel, shows the evolution of proportions of the isotopes after the closure, which corresponds to

$$\mathcal{C}\mathbf{x}(t) = \mathcal{C}\left(\mathbf{x}(0) \oplus (-t \odot \exp[\lambda])\right),\tag{4.6}$$

where exp[] = (exp(1)*,* exp(2)*,* exp(3)). Figure 4.4 shows the evolution of the isotopes in a ternary diagram. The main fact on this exponential decay of isotopes is that it is naturally expressed using perturbation and powering, as in Eq. (4.6). In the simplex, this compositional evolution is a linear one. If proportions are thought as real variables, as they are shown in Fig. 4.3 (right panel), or in Fig. 4.4, then they are taken as non-linear thus ignoring their simplicity as compositional evolution.

The fact that perturbation is easily interpreted on vectors of proportions supports the idea that the simplex is a suitable sample space for compositions. Think, for instance, how perturbation could be interpreted when taking representatives of compositions as projections on the positive orthant of a hypersphere, or on the surface of a unit hypercube. It is not intuitive at all. Obviously, if the operation that is considered relevant for the stated problem is a rotation, the representation on the hypersphere may be a sensible choice of sample space.

### **4.5 Conditions on Metrics for Compositions**

In many applications a distance between data points is a central issue. Cluster analysis is a typical example of this. Other metric concepts are crucial, like the size of a vector, the norm, or the possibility of performing orthogonal projections. Note that all these metric concepts are used in the omnipresent regression analysis. Compositional data analysis has the same need of introducing metrics, distances, norms and orthogonality. From the early developments by J. Aitchison (1983), a distance between compositions was introduced and developed (Aitchison 1992; Aitchison et al. 2000). Nowadays, that distance between compositions is called Aitchison distance, and the corresponding Euclidean geometry is named Aitchison geometry (Pawlowsky-Glahn and Egozcue 2001).

The need of a distance between compositions can be motivated from the most basic statistics. For instance, concepts as elementary as mean and variance are based on a choice of a distance in the sample space. Following Fréchet (1948) (see also Pawlowsky-Glahn et al. 2015, Chap. 6), mean and variance of a sample can be introduced in a metric space (sample space endowed with a distance). Consider a compositional sample *<sup>i</sup>* , *<sup>i</sup>* = 1*,* <sup>2</sup>*,*…*, <sup>n</sup>*, represented in the *<sup>D</sup>*-part simplex *<sup>D</sup>*. The data matrix has the compositions *<sup>i</sup>* as rows. Suppose that a distance in *<sup>D</sup>* is <sup>d</sup>*a*(⋅*,* <sup>⋅</sup>) (this notation corresponds to the Aitchison distance, although here it is used in a generic sense). A first step is to define variability of the sample with respect to a given composition as

$$\operatorname{Var}[\mathbf{X}, \mathbf{z}] = \frac{1}{n} \sum\_{i=1}^{n} \mathrm{d}\_{a}^{2}(\mathbf{x}\_{i}, \mathbf{z}) \,, \quad \mathbf{z} \in \mathbb{S}^{D} \,. \tag{4.7}$$

The sample mean, called center for compositions, and the total variance are then defined as

$$\text{Cen}[\mathbf{X}] = \operatorname\*{argmin}\_{\mathbf{z} \in \mathbb{S}^D} \{ \text{Var}[\mathbf{X}, \mathbf{z}] \}\,,\tag{4.8}$$

$$\text{totVar}[\mathbf{X}] = \min\_{\mathbf{z} \in \mathbb{S}^D} \{ \text{Var}[\mathbf{X}, \mathbf{z}] \} = \text{Var}[\mathbf{X}, \text{Cen}[\mathbf{X}]] \,. \tag{4.9}$$

Equations (4.7), (4.8) and (4.9) show that elementary statistics like mean and variance depend critically on the distance used in the sample space.

The Aitchison distance can be defined in different ways (see Pawlowsky-Glahn et al. 2015). One of them is

$$\mathrm{d}\_{a}^{2}(\mathbf{x}, \mathbf{y}) = \frac{1}{2D} \sum\_{j=1}^{D} \sum\_{k=1}^{D} \left( \ln \frac{x\_{j}}{x\_{k}} - \ln \frac{y\_{j}}{y\_{k}} \right)^{2},\tag{4.10}$$

where it is worth to realize that ln(*xk*∕*xk*)=0. The distance has been subscripted as d*<sup>a</sup>* to emphasize that it is the Aitchison distance. The first observation on the Aitchison distance is that it is scale invariant, as required by Principle 4.2.1. In fact, any multiplicative constant in or cancels out in the log-ratios in Eq. (4.10). After accepting the Aitchison distance as a proper one for compositions, a simple but tedious computation drives us to the expression of the sample center

$$\text{Cen}[\mathbf{X}] = \frac{1}{n} \odot \bigoplus\_{i=1}^{n} \mathbf{x}\_i \,,$$

where ⨁ stands for repeated perturbation, similar to a summation for real addition. At a first glance, just dropping the circles in the signs *⊕* and *⊙*, this expression is an average where the traditional sum has been changed to perturbation. Thus, the computation of Cen[] consists of computing the geometric mean of the columns of and closing the resulting vector if a representation on the simplex is desired.

An interesting question is which are desirable and intuitive properties of a metric (distance, norm, inner product) for compositions. Our geometric intuition comes from our experience in the Euclidean space ℝ<sup>3</sup> and we try to translate these observations to a geometry of the simplex. In this way, if we have a rigid object on the table and we move this to another position, for instance on the floor, we expect that distances between points of the object are equal to those observed previous to the movement. Also, we observe that projecting a segment on the floor (ℝ<sup>2</sup>), perhaps the edge of a roof, produces a segment with length shorter than the original one. If the points delimiting the segment are expressed in Cartesian coordinates, *x* and *y*, on the floor, and *z* vertical or orthogonal to the floor, the projection of the points consists in suppressing the *z*-coordinate. That is, our experience tells us that suppressing coordinates makes the resulting projected distances shorter than or equal to the original ones. Being a little bit more subtle, we realize that suppressing the *z*-coordinate is a special projection (orthogonal projection), but there are other kinds of projections. For instance, the shadow projected by the edge of the roof on the floor may be larger than the length of the edge depending on the position of the sun. This is because the shadow is not an orthogonal projection unless the floor is tilted orthogonal to the sun rays. These daily experiences with Euclidean geometry may inspire the following properties of the geometry in the simplex that we take as requirements.

A. **Equidistance on shift**: The distance between two compositions <sup>1</sup> and <sup>2</sup> in *<sup>D</sup>* is equal to their distance after a shift , that is

$$\mathbf{d}\_a(\mathbf{x}\_1 \oplus \mathbf{z}, \mathbf{x}\_2 \oplus \mathbf{z}) = \mathbf{d}\_a(\mathbf{x}\_1, \mathbf{x}\_2) \; ; \tag{4.11}$$


Point A is essential for defining sensible elementary statistics as shown in Eqs. (4.8) and (4.9). To show the importance of this property a subset of water analyses in Bangladesh has been selected. It comes from a survey conducted in the 1990s as a joint effort by the British Geological Survey and the Department of Public Health Engineering of Bangladesh (British Geological Survey 2001a, b). The subset, called hereafter Northern Bangladesh data, includes 13 disolved ions in Northern Bangladesh (latitude greater than 26 ◦N) and has been selected with the only purpose to serve as illustration. This data set was also used in several studies (see Pawlowsky-Glahn et al. 2015 and references therein). Concentrations of As, Fe and P (mg/l) are shown in a ternary diagram (Fig. 4.5). In the left panel they appear close to the border Fe-P due to the small concentrations of As relative to Fe and P. Right panel of Fig. 4.5 shows the same data set after centering it, that is *⊖* Cen[]. Now details are made visible; for instance, the rounding of As to 1 g/l is now visible in form of straight bands extending from the Fe vertex. Although the aspect of the data points is more disperse in the left panel than the right one, the total variance is equal in the two representations, as perturbation does not change the total variance; that is, totVar[] = totVar[( *⊖* Cen[])]. This points out the inconvenience of using the visual distance (Euclidean distance) in the ternary diagram.

Requirement B is a consequence of point C, and is to be discussed at the end of this section. Requirement C is a bit technical but is again inspired by the real multivariate geometry. Suppose that a sample of *d* real variables has been observed and the corresponding data set is arranged in an (*n, d*) matrix. One may be interested in a multiple scatter-plot of each couple of variables, similar to that shown in Fig. 4.6. The fact that the axes of such plots are perpendicular does not surprise anybody. The assumption is that adding a real variable to a previous set is naturally represented by adding a new coordinate on an axis orthogonal to the previous ones.

Requirement C is implicitly claiming for an orthogonality relation, usually given by an inner product between compositions, namely ⟨*,* ⟩*<sup>a</sup>*, where and are compositions represented in the same simplex, say *<sup>D</sup>*. From this inner product two

**Fig. 4.5** Disolved As, Fe, P data set. Left panel, data expressed in mg/l. Right panel, same data after centering

compositions are orthogonal if they satisfy ⟨*,* ⟩*<sup>a</sup>* = 0. All metric elements can be derived from the inner product. The square-norm (square size) is ‖‖<sup>2</sup> *<sup>a</sup>* <sup>=</sup> ⟨*,* ⟩*<sup>a</sup>*; and square-distance is d2 *<sup>a</sup>*(*,* ) = ‖ *<sup>⊖</sup>* ‖<sup>2</sup> *<sup>a</sup>*. A general property of Euclidean spaces (Queysanne 1973) is that there exists an orthonormal basis constituted by *D* − 1 compositions 1*,* 2*,*…*, <sup>D</sup>*−1. Orthonormal coordinates are then computed as

$$\phi\_k(\mathbf{x}\_1, \mathbf{x}\_2, \dots, \mathbf{x}\_D) = \langle \mathbf{x}, \mathbf{e}\_k \rangle\_a, \quad k = 1, 2, \dots, D - 1, \dots$$

and, consequently,

$$\left\|\mathbf{x}\right\|\_{a}^{2} = \sum\_{k=1}^{D-1} \phi\_{k}^{2}(\mathbf{x}\_{1}, \mathbf{x}\_{2}, \dots, \mathbf{x}\_{D}) \; .$$

The question is which form can the coordinates *<sup>k</sup>* take, so that they satisfy requirements A, B, C, and so that they are compatible with perturbation and powering. These latter conditions lead to the following additional requirement.

#### 4 Modelling Compositional Data. The Sample Space Approach 95

### D. The coordinates in *<sup>D</sup>*, *k*, *<sup>k</sup>* = 1*,* <sup>2</sup>*,*…*, <sup>D</sup>* − 1 satisfy

$$
\phi\_k(\mathbf{x} \oplus (a \odot \mathbf{y})) = \phi\_k(\mathbf{x}) + a \cdot \phi\_k(\mathbf{y})\,,\tag{4.12}
$$

for any compositions *,* , and any real constant .

From requirements A and D, the *<sup>k</sup>* can be deduced. Consider first a two part subcomposition of , denoted (2). These subcompositions constitute a Euclidean space of dimension 1, and two part compositions can be represented by a single coordinate <sup>1</sup> = 1(*x* (2) <sup>1</sup> *, x* (2) <sup>2</sup> ). This function must be scale invariant and such that it can take all real values. A simple log-ratio, <sup>1</sup> = *a*<sup>1</sup> ln(*x* (2) <sup>1</sup> ∕*x* (2) <sup>2</sup> ), where *a*<sup>1</sup> is a real constant to be determined, is a possible choice. The ratio argument within the logarithm guarantees scale invariance, and the logarithm allows <sup>1</sup> to range over all real numbers. The superscripts denoting the number of parts of the subcomposition are superfluous due to the scale invariance property and, from now on, it is assumed that *x* (*k*) *<sup>i</sup>* = *xi* , being the latter the value of the *i*-th component in the large composition .

Consider now a 3-part subcomposition (3) = (*x*1*, <sup>x</sup>*2*, <sup>x</sup>*3) in a 2-dimensional subspace which includes subcompositions (2), that is (*x* (3) <sup>1</sup> *, x* (3) <sup>2</sup> )=(*x*1*, x*2). The additional dimension corresponds to a new coordinate <sup>2</sup> in an orthogonal direction to that <sup>1</sup> as proposed by requirement C. Again this coordinate needs to be scale invariant and taking any real value. A simple choice can be <sup>2</sup> <sup>=</sup> *<sup>a</sup>*<sup>2</sup> ln(*x*3∕gm((2))) where gm denotes geometric mean of the arguments. Iterating the reasoning for increasing number of parts of the subcomposition the *k*-th coordinate takes the form

$$\phi\_k = a\_k \ln \frac{\mathbf{x}\_{k+1}}{\mathbf{g}\_{\mathbf{m}}(\mathbf{x}^{(k)})}, \quad k = 1, 2, \dots, D - 1 \dots$$

These expressions for the coordinates fulfill conditions A–D.

The inner product in a Euclidean space can be expressed using Cartesian coordinates as

$$
\langle \mathbf{x}, \mathbf{y} \rangle\_a = \sum\_{k=1}^{D-1} \phi\_k \boldsymbol{\nu}\_k \,, \tag{4.13}
$$

where *<sup>k</sup>* and *<sup>k</sup>* are the coordinates of the *D*-part compositions , respectively. A tedious exercise consists of substituting the expression of the coordinates in Eq. (4.13) and carrying out the sum for values of *ak* such that all components of , appear in a symmetric way. Up to a multiplicative constant, the result is

$$
\langle \mathbf{x}, \mathbf{y} \rangle\_a = \sum\_{j=1}^D \ln \frac{\mathbf{x}\_j}{\mathbf{g}\_m(\mathbf{x})} \cdot \ln \frac{\mathbf{y}\_j}{\mathbf{g}\_m(\mathbf{y})}, \quad a\_j = \sqrt{\frac{j}{j+1}},
$$

where the *aj* s appear as normalizing constants homogenizing the scale of the different axes. The inner product ⟨*,* ⟩*<sup>a</sup>* is the ordinary inner product of the <sup>ℝ</sup>*<sup>D</sup>* vectors clr() and clr(), which are

$$\text{clr}(\mathbf{x}) = \left( \ln \frac{\boldsymbol{x}\_1}{\mathbf{g}\_\mathbf{m}(\mathbf{x})}, \ln \frac{\boldsymbol{x}\_2}{\mathbf{g}\_\mathbf{m}(\mathbf{x})}, \dots, \ln \frac{\boldsymbol{x}\_D}{\mathbf{g}\_\mathbf{m}(\mathbf{x})} \right),$$

and analogously for clr().

The square Aitchison distance expressed in coordinates is the ordinary Euclidean distance in ℝ*D*−1, which can be compared to the expression using the clr coefficients in ℝ*<sup>D</sup>*:

$$\mathrm{d}\_{a}^{2}(\mathbf{x},\mathbf{y}) = \sum\_{k=1}^{D-1} (\phi\_{k} - \boldsymbol{\varphi}\_{k})^{2} = \sum\_{j=1}^{D} \left( \mathrm{clr}\_{j}(\mathbf{x}) - \mathrm{clr}\_{j}(\mathbf{y}) \right)^{2}. \tag{4.14}$$

Requirement B on dominance of distance of a subcomposition is now evident. From the expression of the distance in coordinates (Eq. 4.14, central term), computing distances within a subcomposition consists of removing some positive terms from the sum.

Apparently, there are many possible choices for the form of coordinates *k*, but most of them are discarded by requirements A and D on compatibility with perturbation (Eqs. 4.11, 4.14). For instance, *<sup>k</sup>* = ln(*xk*+1∕(*x*<sup>1</sup> + *x*<sup>2</sup> +···+ *xk*)), implicitly proposed in Aitchison (1986), Sect. 10.3, does not lead to a distance and coordinate expressions satisfying A and D. The critical point is that amalgamation or sum of compositional parts is not a linear operation for compositions.

Figure 4.6 shows the sample of disolved As, Fe, P previously represented in Fig. 4.5 in ilr-coordinates. These coordinates are the balances

$$\phi\_1 = \sqrt{\frac{2}{3}} \ln \frac{\text{As}}{(\text{Fe} \cdot \text{P})^{(1/2)}}, \quad \phi\_2 = \sqrt{\frac{1}{2}} \ln \frac{\text{Fe}}{\text{P}}.$$

The visual distances between the data points are now the Aitchison distances. The triangles correspond to the original data set. Its center, expressed in coordinates, is the point where the arrow is anchored. A shift (perturbation) is applied in order to center the data set (circles), so that the new center is the origin of coordinates (end of the arrow). Importantly, the distances between data points after shifting (requirement A) are equal to the previous ones. The fact that the axes are drawn orthogonally, exactly corresponds to the fact that these coordinates are orthogonal in the Aitchison geometry for compositional data.

The historical way of defining the centered log-ratio transformation of and the whole structure was the reverse of the one here presented. The definitions of perturbation, powering and clr can be found in Aitchison (1986), although the Aitchison distance was already introduced in Aitchison (1983) and discussed in Aitchison et al. (2000). The inner product as such, and the corresponding Euclidean space structure (Aitchison geometry), was introduced independently in Pawlowsky-Glahn and Egozcue (2001), and in Billheimer et al. (2001), although there is a previous definition of orthogonal log-contrasts in Aitchison (1986). Orthogonal coordinates were introduced in Egozcue et al. (2003), and in Egozcue and Pawlowsky-Glahn (2005).

### **4.6 Consequences of the Aitchison Geometry in the Sample Space of Compositional Data**

The consequences of the Euclidean character of the Aitchison geometry for compositional data are multiple and relevant. Once the principles and requirements on the sample space are assumed, they appear as a guidance in most, if not all, statistical models. The main idea is that compositions are advantageously represented as vectors in coordinates, better than as proportions. Standard operations, sum and multiplication, on appropriate coordinates are equivalent to perturbation and powering on compositions in the simplex. The fact that Aitchison distances, norms and orthogonal projections are transformed into the ordinary Euclidean distances, norms and orthogonal projections opens the door to use on ilr coordinates all mathematical and statistical methods designed for real variables. The recommendation of working on coordinates has been formulated as *the principle of working on coordinates* (Mateu-Figueras et al. 2011). The specific exploratory tools for compositional data are examples of the usefulness of ilr coordinates.

Principal component analysis for compositional data (CoDa-PCA) and its graphical representation, the CoDa-biplot, were studied before ilr-coordinates were available (Aitchison 1983; Aitchison and Greenacre 2002), but they are a wonderful example of their usefulness. A *D*-part compositional data set, in a (*n, D*)-matrix, is clr-transformed and centered; then, the singular value decomposition is carried out. This can be summarized as

$$\text{clr}(\mathbf{X}\_c) = \text{clr}(\mathbf{X} \ominus \mathbf{1}\_n \text{Cen}[\mathbf{X}]) = \mathbf{U} \Lambda \mathbf{V}^\top,\tag{4.15}$$

where clr is applied to each composition (row) of the centered matrix, and *<sup>n</sup>* is a column vector of *n* ones. The diagonal matrix contains *D* − 1 singular values ordered from the largest one to the smallest. The *D*-th singular value is always null, since the rows of clr(*c*) add to zero, and can be removed. The (*D, D* − 1)-matrix (loadings matrix), once the last column corresponding to the null singular value is removed, is orthogonal and satisfies *⊤* = *<sup>D</sup>*−1, *<sup>⊤</sup>* = *<sup>D</sup>* − (1∕*D*)*D<sup>D</sup> <sup>⊤</sup>*. Therefore, it is a contrast matrix like that used to compute ilr-coordinates of a composition (column vector) (Egozcue et al. 2011)

$$\mathbf{z} = \mathrm{ilr}(\mathbf{x}) = \mathbf{V}^{\mathsf{T}} \mathrm{clr}(\mathbf{x}), \quad \mathbf{x} = c \cdot \exp[\mathbf{V} \mathbf{z}] \ .$$

This means that the rows of the (*n, D* − 1)-matrix are ilr-coordinates of the centered compositional data set. A form biplot represents simultaneously the rows of

**Fig. 4.7** Biplots of Northern Bangladesh data set, representing 13 disolved ions. Left: form biplot showing that the projection is mainly dominated by the clr coefficients of As, Mn, and SO4; up to the projection (65.2% of total variance), Aitchison distances between data points are approximately those visualized. Right: covariance biplot adequate for interpretation. Up to the projection, length of links between vertices of rays are proportional to the standard deviation of the corresponding logratio. The length of the rays are approximately proportional to the standard deviation of the corresponding clr-coefficients. Variability is largely dominated by the log ratios of SO<sup>4</sup> over As, Fe and Mn

 (coordinates of the compositions) and the columns of (clrunitary vectors of the ilr-basis) in an optimal bi-dimensional projection for visualization.

Figure 4.7 shows the form biplot of the Northern Bangladesh data set. Form biplots (Fig. 4.7, left) and scatter-plots of coordinates (Fig. 4.6) can replace plots on ternary diagrams, as distances between compositions are not distorted in an uncontroled manner. They are only affected by the orthogonal projections.

The ilr coordinates are real variables and their exploratory analysis relies on standard exploratory analysis tools (mean, standard deviation, quantiles, correlations). However, interpretable coordinates are desirable. They can be designed by the analyst to get insight in some aspects of the data he/she may be interested in. Other times a data driven technique may be used to design suitable coordinates (Pawlowsky-Glahn et al. 2011; Martín-Fernández et al. 2017). In these cases, the CoDa-dendrogram (Pawlowsky-Glahn and Egozcue 2011) can be useful to summarize properties of the coordinate sample jointly with an interpretable description of the coordinates used. The definition of the coordinates is based on a sequential binary partition (SBP) of the parts of the composition (Egozcue and Pawlowsky-Glahn 2005, 2006). Each coordinate is associated with a partition of a group of parts into two new groups. For instance, Table 4.2 shows this kind of partitions for the Northern Bangladesh data set. The second row of Table 4.2, indicates the separation of As (+1) from the group constituted by Fe, Mn and P (−1). This separation is associated with the second ilr coordinate


**Table 4.2** Sign code for a SBP of the 13 disolved ions, obtained by clustering variables of the Northern Bangladesh data set

$$z\_2 = \sqrt{\frac{3}{4}} \ln \frac{\mathbf{As}}{(\mathbf{Fe} \cdot \mathbf{Mn} \cdot \mathbf{P})^{1/3}} \dots$$

These kinds of coordinates are called balances between two groups of parts (Egozcue and Pawlowsky-Glahn 2005) as they are logratios of the geometric mean of the elements in each group; the coefficient in front of the logarithm is a normalization coefficient which takes into account the number of elements in each group of parts. Figure 4.8 shows the CoDa-dendrogram for the Northern Bangladesh data set. The tree-dendrogram itself follows the partition in Table 4.2. The length of the lines perpendicular to the labels, say vertical lines, are proportional to the variance of the balance separating the groups of elements at left and right hand sides. These vertical lines are anchored to horizontal segments joining the two groups of parts. All these segments are scaled in such a way that the zero value is placed in the center of the segment, and the length represents the same length in all cases. The fulcrum of the vertical line is placed at the average value of the balance; it can be compared to the median indicated in the box-plot under the horizontal line. In this way, the CoDa-dendrogram combines the interpretation of the balance-coordinates given by the SBP and their mean, variance and quantiles (box-plots).

In Fig. 4.8, the variances within the subcomposition (Zn, Si, Sr, Na, SO4) are small compared to other variances, thus pointing out a possible compositional association between these elements; it suggests that these elements change proportionally along the considered sample. At the same time, most of the total variance is driven by As, Fe, Mn and P, as indicated by longer vertical lines.

The explanatory power of the CoDa-biplot and the CoDa-dendrogram relies on the fact that they are based on Cartesian coordinates for plotting data-points and that the represented variables are orthonormal in a geometric sense. The key in interpret-

**Fig. 4.8** CoDa-dendrogram following the sign code in Table 4.2 obtained by clustering variables of the Northern Bangladesh data set. Vertical bars describe the decomposition of the total variance given in Eq. (4.16). Anchoring points of vertical bars indicate the mean value of the corresponding coordinate

ing the results is the decomposition of the total variance of the data set into variances of the ilr-coordinates (Egozcue and Pawlowsky-Glahn 2011a)

$$\text{totVar}[\mathbf{X}] = \sum\_{k=1}^{D-1} \text{Var}[\phi\_k] \,. \tag{4.16}$$

### **4.7 Conclusions**

The first step in any data modelling is to establish a sample space able to give answers to the questions stated by the analyst. If these questions involve probabilistic statements, the sample space needs a sigma field of events for which probabilities can be defined. However, most analysts search for statements implying operations, distances, projections between data points or variables. All these concepts need to be defined in the sample space for useful computations and interpretations. These definitions are not intrinsic, but are adapted to the questions stated by the analyst in a subjective way. Therefore, the choice of a sample space has always a subjective character, which is only validated by the ability in giving useful answers to sound questions.

Compositional data require defining a sample space with a rich structure. The log-ratio approach to the analysis of compositional data is based on a set of principles and conditions. The approach here presented is a modification of the standard principles introduced by J. Aitchison in the eighties and reformulated afterwards. Scale invariance and compositional equivalence are maintained exactly as they were introduced, but additional conditions are to be discussed in relation to perturbation, which is assumed to be the main operation between compositions. The Euclidean structure of compositional data represented in the simplex, called Aitchison geometry, is here motivated using the idea that reduction to a subcomposition should be an orthogonal projection.

The Aitchison geometry is thought as a powerful mathematical tool which consistently completes the previous Aitchisonian ideas on the log-ratio approach. The main points are the conception of compositions as equivalence classes (Barceló-Vidal and Martín-Fernández 2016) thus overcoming the early definitions based on the constant sum constraint; and the introduction of coordinates in the Aitchison geometry (Pawlowsky-Glahn and Egozcue 2001; Egozcue et al. 2003; Egozcue and Pawlowsky-Glahn 2005) thus overcoming the idea that taking log-ratios is just a transformation which circumvents the constant sum constraint.

**Acknowledgements** The research on compositional data analysis has been continuously supported by the *International Association for Mathematical Geosciences* (formerly *for Mathematical Geology*), *IAMG*. The authors appreciate this support, in many cases unconditional, and recognize that the development of the compositional data analysis would have been a lot harder without it. This research has been funded by the Spanish Ministry of Economy and Competitiveness under the project CODA-RETOS/TRANS-CODA (Ref: MTM2015-65016-C2-1/2-R).

### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **Chapter 5 Properties of Sums of Geological Random Variables**

### **G. M. Kaufman**

"All models are wrong. Some are useful" George E. P. Box.

**Abstract** In the absence of empirical data that allows resolution of the vexing problem of how to address probabilistic dependencies among and between elements of large sets of geologic random variables data we need methods that refocus and streamline expert geological judgment inputs along with analytical methods for modeling dependencies that go beyond pairwise correlation and its cousins. Some possibilities are reviewed.

### **5.1 Introduction**

Suppose that you are given the marginal distribution of each of a set of n random variables but no other information. What can be said about the behavior of their sum? This is an old problem, extensively studied by probability theorists and statisticians (Hoeffding 1940; Frèchet 1951). There is a rich probabilistic finance and actuarial risk analysis literature devoted to calculation of bounds on sums of random variables. This question motivates our review of state of the art methods designed to reduce geologists' cognitive load when asked to assign judgmental probabilities to uncertain geologic variables.

In a wide range of settings geologists are asked to provide personal probability judgments about a collection of uncertain quantities and, in particular, about sums of them. Probabilistic assessments of oil and gas in unexplored petroleum plays and basins are recurring examples. In the absence of hard data they deal rather well with the cognitive task of providing personal judgments about marginal distributions of geologic attributes; i.e. their assessments are, in the large, reasonably well calibrated. Geologists' personal judgments about dependencies among uncertain geologic quantities are more problematic.

© The Author(s) 2018 B. S. Daya Sagar et al. (eds.), *Handbook of Mathematical Geosciences*, https://doi.org/10.1007/978-3-319-78999-6\_5

G. M. Kaufman (✉)

Management Emeritus, E62-437, Sloan School of Management MIT, 50 Memorial Drive, Cambridge, MA 92142, USA e-mail: gkaufman@mit.edu

It is worthwhile to distinguish micro-assessments—assessment of dependencies among individual reservoir attributes for example—from macro-assessments assessment of dependencies among assessment units, each of which may be a collection of anomalies, reservoirs and fields. Measurable data bearing directly on probabilistic dependencies at the micro-assessment level is often available but precise measurable data bearing on dependencies among elements in a macro-assessment is seldom available. Chen et al. (2012) point out that

Although efforts have been made to address variable dependence in both methodology and tool development, the greatest emphasis and attention have been given to resource aggregation. Until now, the impact of interdependencies among variables in volumetric resource calculations has been mostly ignored, and the implementation of variable dependency remains a challenge to petroleum resource appraisal. In practice, inadequate data commonly exist to either specify a standard multivariate distribution with an appropriate correlation structure or to quantify the resource aggregation correlation matrices. However, variable correlations are so common among geologic variables that ignoring their interdependence may lead to serious bias, affecting both the resulting resource potential estimation.

Most geologists with some training and experience in probability assessment can provide reasonable responses to questions about marginal distributions of individual attributes of a target entity. Few if any are well equipped to provide sharp coherent judgments about possible dependencies among them. Some progress has been made in understanding how to elicit sensible, coherent judgements about second order co-variability of petroleum assessment units—the recent USGS study of CO2 sequestration in depleted oil and gas reservoirs is an example. However, specification of marginal distributions along with second order moments is not sufficient for identification of a joint distribution of a set of uncertain quantities. This matters when interest centers on the right tail of a sum of magnitudes of petroleum in assessment units. Excepting special cases—joint lognormality for example—the right tail of a sum of jointly dependent uncertain quantities can, both in principle and in practice differ meaningfully from the right tail of an approximation based on marginal distributions and second moment properties alone. Lillestøl and Sinding-Larsen's (2017) study of giant field probabilities based on 182 North Sea discoveries highlights the importance of accurate modeling of tail probabilities. For economists, bureaucrats and politicians right tail probabilities are often the most interesting feature of a probabilistic oil and gas assessment. What, for example, is the probability of finding at least one more giant field in a given mature petroleum province? Objectives here are first, to outline how methods currently used by geologists to impute probabilistic dependencies among uncertain geologic quantities fit (or don't fit) into a conceptual framework developed by probabilists to answer the question posed at the outset and second, to review how the probability distribution of a sum of such quantities can be bounded given knowledge of marginal distributions alone assuming they are governed by a type of functional dependency called co-monotonicity. Co-monotonicity and cupolas are conceptual twins.

Section 5.2 lays out necessary theory and definitions and calls attention to co-monotonic upper bounds on sums of random variables and lower bounds expressed in terms of conditional expectations. Section 5.3 addresses geologic case studies in two of which geologists compute a probability distribution of a sum of random geologic magnitudes in three steps: first, specify marginal distributions of each magnitude, second, elicit judgmental appraisals of pairwise correlations among magnitudes and third, combine the two using Monte Carlo simulation to arrive at a distribution of the sum. This approach might be labelled "incomplete specification" (not to be confused with the econometric definitions of just-, overand under-specification.). Iman and Conover's (1982) ingenious method for imputing dependencies among a set of random variables requiring only pairwise correlations among elements of that set and marginal distributions is deployed in the *CO*<sup>2</sup> sequestration study cited above (Sect. 5.3.2). Chen et al. (2012) use of cupolas to capture probabilistic dependencies in geologic micro-assessments is reviewed in Sect. 5.3.3. Brief concluding remarks appear in Sect. 5.4. Blondes et al. (2013a, b) offer a sensible rationale for careful attention to dependencies:

In the Circum-Arctic aggregation of the 48 AUs, the 90-percent uncertainty interval for recoverable gas is 1,471, 2,009, or 3,515 tcf for assumptions of independence, assessor specified dependency (correlation), or total dependence respectively. Clearly, decision makers who rely on assessment results need accurate interval projections. Too broad an interval provides little information; too narrow an interval gives a false sense of precision.

Spatial modeling provides important insights into the structure of probabilistic dependencies among petroleum play attributes and deserves careful attention in parallel with methods and models discussed here. It is a topic for another day.

### **5.2 Preliminaries**

Define *FX* to be the distribution function of a random vector **<sup>X</sup>** <sup>=</sup> <sup>ð</sup>*X*1, ... , *Xn*<sup>Þ</sup> *t* with domain **<sup>R</sup>***<sup>n</sup>* and marginal distributions *Fi*, *<sup>i</sup>*= 1, ... , *<sup>n</sup>*. Set *FX*ð**x**Þ<sup>=</sup> *Prob*f*X*<sup>1</sup> <sup>≤</sup> *<sup>x</sup>*1, ... , *Xn* <sup>≤</sup> *xn*g. Assume that each *Fi* is continuous and possesses a one to one inverse. Define the *p*th fractile of *Xi* as the value in the domain of *Xi* such that *Prob*f*Xi* <sup>≤</sup> *xp*g<sup>=</sup> *<sup>p</sup>* and its inverse as *<sup>F</sup>* <sup>−</sup><sup>1</sup> *<sup>i</sup>* <sup>ð</sup>*p*Þ<sup>=</sup> *xi*ð*p*Þ. In turn the *<sup>p</sup>*th fractile of the sum *Sn* <sup>=</sup>*X*<sup>1</sup> <sup>+</sup> <sup>⋯</sup> <sup>+</sup>*Xn* is *sp* such that *Prob*f*Sn* <sup>≤</sup> *sp*g<sup>=</sup> *<sup>p</sup>* or *<sup>F</sup>* <sup>−</sup><sup>1</sup> *Sn* <sup>ð</sup>*p*Þ<sup>=</sup> *sp*.

What conditions guarantee that fractiles are strictly additive? That is that for all *<sup>p</sup>*<sup>∈</sup> <sup>ð</sup>0, 1Þ*sp* <sup>=</sup> *<sup>x</sup>*<sup>1</sup>ð*p*Þ<sup>+</sup> <sup>⋯</sup> <sup>+</sup>*xn*ð*p*Þ? Imposition of functional dependencies among *X*1, ... , *Xn* is one route to sufficient conditions for this to be true. To divide difficulties suppose that *X*1, ... , *Xn* share a common domain *DX* and consider *n* continuous invertible functions *hi*, each with domain *DX*. Suppose that *xi* <sup>=</sup> *hi*ð*x*<sup>1</sup><sup>Þ</sup> for all *xi* <sup>∈</sup> *DX*, *<sup>i</sup>*= 2, .., *<sup>n</sup>*. Then *Prob*f*Sn* <sup>&</sup>lt;*s*g<sup>=</sup> *Prob*f*X*<sup>1</sup> <sup>+</sup> *<sup>h</sup>*<sup>2</sup>ð*X*<sup>1</sup>Þ<sup>+</sup> <sup>⋯</sup> <sup>+</sup> *hn*ð*X*<sup>1</sup>Þ<*s*g. The omnibus function *<sup>g</sup>*ð*x*<sup>1</sup>Þ<sup>=</sup> *<sup>x</sup>*<sup>1</sup> <sup>+</sup> *<sup>h</sup>*<sup>2</sup>ð*x*<sup>1</sup>Þ<sup>+</sup> <sup>⋯</sup> <sup>+</sup>*hn*ð*x*<sup>1</sup>Þ, *<sup>x</sup>*<sup>1</sup> <sup>∈</sup> *DX* is continuous and invertible so *Prob*f*g*ð*X*<sup>1</sup>Þ<sup>&</sup>lt; *<sup>s</sup>*g=*Prob*f*X*<sup>1</sup> <sup>&</sup>lt;*<sup>g</sup>* <sup>−</sup><sup>1</sup>ð*s*Þg. The *<sup>p</sup>*th fractile of *Sn* is *sp* such that *Prob*f*g*ð*X*<sup>1</sup>Þ<sup>&</sup>lt; *sp*g<sup>=</sup> *<sup>p</sup>* or *Prob*f*X*<sup>1</sup> <sup>&</sup>lt; *<sup>g</sup>* <sup>−</sup><sup>1</sup>ð*sp*Þg<sup>=</sup> *<sup>p</sup>* leading to *<sup>x</sup>*<sup>1</sup>ð*p*<sup>Þ</sup> <sup>=</sup>*<sup>g</sup>* <sup>−</sup><sup>1</sup>ð*sp*Þ. Equivalently *<sup>g</sup>*ð*x*<sup>1</sup>ð*p*ÞÞ=*sp*. Functional dependencies of this type are too strong to survive the rigors of modeling most real world data. In the absence of complete knowledge of a joint distribution co-monotonicity is a more flexible approach to modeling joint behavior of dependent random variables.

**Definition** The random vector **<sup>X</sup>** <sup>=</sup> <sup>ð</sup>*X*1, ... , *Xn*Þ *<sup>t</sup>* is co-monotonic if and only if ð*X*1, ... , *Xn*Þ<sup>=</sup> *<sup>d</sup>*ð*<sup>F</sup>* <sup>−</sup><sup>1</sup> <sup>1</sup> <sup>ð</sup>*U*Þ, ... , *<sup>F</sup>* <sup>−</sup><sup>1</sup> *<sup>n</sup>* <sup>ð</sup>*U*ÞÞ, *<sup>U</sup>* a uniform random variable with domain (0, 1).

Here =*<sup>d</sup>* means agreement in distribution. Intuitively each element of a co-monotonic random vector is a functional of a single random variable *U* so all elements of **X** exhibit strong positive dependency. McNeil et al. (2005) provide a more general definition: **X** is co-monotonic if and only if it agrees in distribution with a random vector, each of whose components is a non-decreasing function of a single random variable. If elements of **X** are co-monotonic increasing one element of **X** increases all others. Goovaerts et al. (2000) provide a clear readable account of properties of sums of co-monotonic random variables in an actuarial context. Deelstra et al. (2009) offer a literature review of co-monotonicity in financial economics.

Foreshadowing a possible critique by geologists that in their setting, some elements of **X** may be independent or possibly negatively dependent (rather rare), co-monotonicity and its consequences provide upper and lower bounds on a sum of random variables with specified marginal distributions that embrace a wide range of dependence structures. When these bounds are judged to be tight enough, reasonable projections of probability distributions of aggregates can be made using marginal distributions along with specification of certain conditional expectations. (See 5.1, 5.5). They provide useful information about projections made based on information elicited from geologists about dependencies and police reasonableness of geologic probabilistic projections of uncertain geologic resources made using other methods.

### *5.2.1 Bounds*

A random variable *X* precedes a random variable *Y* in convex order, denoted by *<sup>X</sup>* <sup>≥</sup>*cxY* if and only if *<sup>E</sup>*ð*g*ð*X*ÞÞ<sup>≥</sup> *<sup>E</sup>*ð*g*ð*Y*ÞÞ for all real convex functions *<sup>g</sup>* for which expectations are finite. Kaas et al. (2009) use convex order to show that fractiles of co-monotonic random variables can be added in the following sense: for any random vector **<sup>X</sup>** <sup>=</sup>ð*X*1, ... , *Xn*Þ possessing marginal cumulative distribution functions *F*1, ... , *Fn* and *U* a uniform (0, 1) random variable

$$(X\_1 + \dots + X\_n) \le\_{cx} S\_\mu \equiv F\_1^{-1}(U) + \dots + F\_n^{-1}(U). \tag{5.1}$$

If *Su* =*dF* <sup>−</sup><sup>1</sup> <sup>1</sup> <sup>ð</sup>*U*Þ<sup>+</sup> <sup>⋯</sup> <sup>+</sup> *<sup>F</sup>* <sup>−</sup><sup>1</sup> *<sup>n</sup>* <sup>ð</sup>*U*<sup>Þ</sup> it follows immediately that the *<sup>p</sup>*th fractile of *Su* is *F* <sup>−</sup><sup>1</sup> *Su* <sup>ð</sup>*p*Þ<sup>=</sup> *<sup>F</sup>* <sup>−</sup><sup>1</sup> <sup>1</sup> <sup>ð</sup>*p*Þ<sup>+</sup> <sup>⋯</sup> <sup>+</sup> *<sup>F</sup>* <sup>−</sup><sup>1</sup> *<sup>n</sup>* <sup>ð</sup>*p*Þ, *for all p*<sup>∈</sup> <sup>ð</sup>0, 1Þ. They point out that (5.1) is a supremum in terms of convex order and is a best bound for marginal distributions in a Fréchet space. It is well known that if a random vector **X** with marginal distributions *F*1, ... , *Fn* belong to a Fréchet space *<sup>n</sup>* the joint cumulative distribution function *Prob*f*X*<sup>1</sup> <sup>≤</sup>*x*1, ... , *Xn* <sup>≤</sup>*xn*<sup>g</sup> of **<sup>X</sup>** is bounded from above by *Mn* <sup>≡</sup> minf*F*<sup>1</sup>ð*x*<sup>1</sup>Þ, ... , *Fn*ð*xn*Þg. Goovarts et al. note that *Mn* is reachable in *n*.

For sums of elements of **X** introduction of a random variable *Z* such that distribution functions of each *Xi* given *Z* are known with certainty leads to refined upper and lower bounds. In a geologic context *Z* is interpretable as a latent (background) variable describing gross geologic characteristics of, for example, a petroleum assessment unit. The conditioning variable *Z* might be regression dependent on geologic attributes of an assessment unit and need not be scalar. These authors define *F* <sup>−</sup><sup>1</sup> *Xi*j*<sup>Z</sup>*ð*U*<sup>Þ</sup> to be a random variable *fi*ð*U*, *<sup>Z</sup>*<sup>Þ</sup> that for ð*U*, *<sup>Z</sup>*Þ=ð*u*,*z*Þ assumes value *<sup>F</sup>* <sup>−</sup><sup>1</sup> *Xi*j*<sup>z</sup>*ð*u*<sup>Þ</sup> and prove that for *<sup>U</sup>* uniform <sup>ð</sup>0, 1<sup>Þ</sup> and *<sup>Z</sup>* independent of *U*

$$(X\_1 + \dots + X\_n) \le\_{cx} S\_u^{\;^\*} \equiv F\_{X\_1|Z}^{-1}(U) + \dots + F\_{X\_n|Z}^{-1}(U). \tag{5.2}$$

Jensen's inequality leads to a lower bound

$$E(X\_1|Z) + \dots + E(X\_n|Z) \le\_{cx} (X\_1 + \dots + X\_n). \tag{5.3}$$

Kaas et al. (2009) point out that (a) the random vector *<sup>E</sup>*ð*X*1j*Z*Þ <sup>+</sup> <sup>⋯</sup> <sup>+</sup>*E*ð*Xn*j*Z*Þ will not in general have marginal distributions *<sup>F</sup>*1, .., *Fn* (b) If *<sup>E</sup>*ð*X*1j*Z*Þ, ... , *<sup>E</sup>*ð*Xn*j*Z*<sup>Þ</sup> are either jointly non-increasing or non-decreasing functions of *Z* the LHS in (5.3) is a sum of co-monotonous random variables and (c) *Var*ð*E*ð*Xi*j*Z*ÞÞ<*Var*ð*Xi*Þ unless *Var*ð*E*ð*Xi*j*Z*ÞÞ= 0. In order to create a path to direct computation of the cdf of the LHS of (5.4) suppose that (b) obtains and that each of the random variables *<sup>E</sup>*ð*X*<sup>1</sup>j*Z*Þ, ... , *<sup>E</sup>*ð*Xn*j*Z*Þ are non-decreasing functions of increasing *<sup>Z</sup>* <sup>=</sup> *<sup>z</sup>*. Write the lower bound as *<sup>E</sup>*ð*X*<sup>1</sup>j*Z*Þ<sup>+</sup> <sup>⋯</sup> <sup>+</sup> *<sup>E</sup>*ð*Xn*j*Z*Þ=*E*ð*S Z*<sup>j</sup> <sup>Þ</sup> and define *FE*ð*Xi*j*<sup>Z</sup>*Þð*x*Þ<sup>=</sup> *Prob*f*E*ð*Xi*j*Z*Þ<sup>≤</sup> *<sup>x</sup>*g. They show that, provided that the cdf of *<sup>E</sup>*ð*Xi*j*Z*Þ is continuous and increasing

$$F\_{E(X\_1|Z)}^{-1}(F\_{E(S|Z)}(\\\mathbf{x})) + \dots + F\_{E(X\_n|Z)}^{-1}(F\_{E(S|Z)}(\\\mathbf{x})) = \mathbf{x},\tag{5.4}$$

a prescription for calculating a lower bound. The quality of the lower bound (5.3) depends of course on the choice of a model for Z. Kaas et al. (2002) and Goovarts et al. (2000) demonstrate that upper and lower bounds (5.1) and (5.3) provide reasonable bounds on the cumulative distribution function of certain sums of discounted cash flows as well as for the cumulative distribution function of sums of dependent lognormal random variables. Lux and Papantoleon (2017) show that upper and lower Fréchet–Hoeffding bounds such as those described above can be tightened. They demonstrate that other types of information, knowledge of functionals of lower dimensional marginals of an n-dimensional cupola for example, also lead to improvements. The tradeoff is that the improved bounds are quasi-cupolas but not cupolas.

Comparison of predictive distributions of undiscovered mineral resources derived by conventional methods currently in use with co-monotonic bounds on them is a promising avenue of research.

### **5.3 Thumbnail Case Studies**

Thumbnail sketches of three case studies serve as a template for discussion of probabilistic dependence issues discussed above: examples of the USGS approach to probabilistic dependencies among oil and gas assessment units, the USGS probabilistic assessment of CO2 sequestration in mature oil and gas reservoirs in the United States and a Canadian Geological Survey study of use of cupolas to capture probabilistic dependencies among accumulations in individual oil and gas plays.

### *5.3.1 USGS Oil and Gas Resource Projections*

The USGS developed an assessment system in the 1980s with the acronym FASP (fast appraisal system for petroleum resources). FASP incorporated perfect positive correlation between micro-level reservoir attributes but allowed specification of any positive correlation in the course of aggregating play resources. However, the USGS 2000 World Petroleum Assessment aggregates undiscovered resource volumes from assessment unit level to regional level using perfect correlation as the argument for adding assessment unit fractiles to arrive at regional level aggregates. Recognizing that at the global level dependencies among large regional aggregates of resources are unlikely to be perfectly correlated they adopt pairwise correlation of 0.5 between pairs of eight regions (Klett et al. 2000). No sensitivity analysis of how aggregate projections vary with these particular choices is provided.

Many USGS assessment studies present tables of fractiles of individual assessment units and then add them to arrive at a fractile assessment of total resources. Addition is qualified by the statement that "Fractiles are additive under assumption of perfect positive correlation" allowing avoidance of direct assessment of dependencies among units. Table 2 in "Assessment of Undiscovered Continuous Oil and Gas Resources in the Monterey Formation, San Joaquin Basin Province, California" USGS Fact Sheet 2015-3058 September 2015 and Table 2 in USGS Fact Sheet 2014–3082 "Assessment of Potential Shale-Oil and Shale-Gas Resources in Silurian shales of Jordan" September 2014 are examples. Chen et al. (2012) cite additional examples (Klett et al. 2000, 2005; Klett 2004). It is easy to show that "perfect correlation" is not robust to variations in specification of the functional form of marginal distributions elicited from geologists. Worse, addition of fractiles without careful attention to properties of the joint distribution of a set of uncertain quantities can lead to incoherence. On the other hand mutual independence allows specification of arbitrary marginal probability distributions without doing violence to coherence but often leads to an unacceptably narrow probability projection of sums of oil and gas magnitudes.

A salient feature of Pearson's correlation coefficient is that random variables *X* and *Y* possess correlation 1.0 or − 1.0 only if *X* and *Y* are linearly dependent. As Denuit and Dehaene (2003) point out, a limiting case is a bivariate normal pair of random variables for which the variance of one member of the pair is zero. If *X* and *Y* are jointly lognormal and log *X* is a linear function of log *Y* the Pearson correlation of log *X* and log *Y* is either 1.0 or −1.0. However, the Pearson correlation of *X* and *Y* is then less than 1.0. Denuit and Dehaene provide a more nuanced treatment. Suppose *F*<sup>1</sup> and *F*<sup>2</sup> are marginal cumulative distribution functions of *<sup>X</sup>* and *<sup>Y</sup>* respectively, each concentrated on ð0, <sup>∞</sup>Þ and *<sup>U</sup>* is a uniform random variable independent of *X* and *Y*. Using super-modularity these authors prove that if *<sup>F</sup>*<sup>1</sup> and *<sup>F</sup>*<sup>2</sup> lie in a Fréchet space the Pearson correlation coefficient *<sup>r</sup>*ð*X*, *<sup>Y</sup>*<sup>Þ</sup> of *X* and *Y* is bounded by

$$\frac{\operatorname{Cov}(F\_1^{-1}(U), F\_2^{-1}(1-U))}{\sqrt{\operatorname{Var}(X)}\sqrt{\operatorname{Var}(Y)}} \le r(X, Y) \le \frac{\operatorname{Cov}(F\_1^{-1}(U), F\_2^{-1}(U))}{\sqrt{\operatorname{Var}(X)}\sqrt{\operatorname{Var}(Y)}} \,. \tag{5.5}$$

In this setting perfect correlation is not achievable. They also prove that it is possible for a pair of co-monotonic lognormal random variables to have pairwise correlation close to zero, contradicting the intuitive notion that small correlation implies weak dependence. Denuit and Dehane call attention to Shih and Huang (1992) and Schechtman and Yitzhaki's (1999) observation that, for any two random variables, the achievable range of Pearson's correlation coefficient is (−1, 1) only if the functional form of the two marginal distributions differ solely in values of location and/or scale parameters. If not, the range of Pearson's r is narrower than (−1, 1) and depends on the shape of the two marginal distributions.

These authors document several important features of Kendall's *τ* and Spearman's *ρ*. (Spearman's *ρ* is at the center of the Iman and Conover method deployed in the USGS (2013) study of *CO*<sup>2</sup> sequestration to compute predictive probability distributions of aggregates). First, both are invariant with respect to strictly monotone transformations. Second, when one variable is a non-decreasing (non-increasing) transformation of the other they equal 1 (or −1) at the Fréchet upper (resp. lower) bound. They note that at a value of 1.0 or −1.0 Kendall's *τ* and Spearman's *ρ* achieve Fréchet bounds. According to them Kendall's *τ* and Spearman's *ρ* are more desirable measures of association for non-normal multivariate distributions than Pearson's *r* because the latter does not share Kendall and Spearman's correlation invariance properties. These invariance properties come into play in Iman and Conover's method discussed below. Denuit and Dehane prove the non-obvious fact that if positively or negatively quadrant dependent random couples are jointly uncorrelated they are mutually independent.

All of this emphasizes that "perfect correlation" as an omnibus argument for adding fractiles has many pitfalls. Co-monotonic bounds on random sums are a conceptually satisfactory alternative that deserves much future study.

### *5.3.2 USGS Probabilistic Assessment of CO2 Storage Capacity*

A recent USGS probabilistic assessment of *CO*<sup>2</sup> sequestration in mature petroleum reservoirs (Blondes et al. 2013a, b) is based on both micro- and macro-assessments by geologists. Their macro-assessment aggregates storage assessment units (SAUs) at basin, regional and national levels. An objective was to provide probabilistic assessments that take into account dependencies among assessment units arising from "overlap of geologic analogs, assessment methods and assessors" using individual SAU marginal probability distributions and "…a correlation matrix obtained by expert elicitation describing interdependencies between pairs of SAUs". The correlation matrix dimension is 192 × 192. Because a menagerie of marginal distributions—Beta-PERT, lognormal, truncated lognormal—were deployed at the micro-level use of standard multivariate distribution theory is not appropriate. Dependencies among storage capacity magnitudes are induced using an innovative distribution free method developed by Iman and Conover (1982) that allows marginal distribution shapes to be estimated from data sets distinct from data sets used to estimate dependency structure. Their method is designed to provide rank correlations that match assessed correlations and to translate the match into a predictive probability distributions for individual assessment units and larger aggregates. (See Blondes et al. 2013a for informative examples).

How to aggregate from basin, to region and then to a national scale is an issue. Should this be done in a single stage using the correlation matrix for all SAUs in the study or successively aggregate subsets of SAUs in multiple stages? Blondes et al. (2013b) conclude that

Although the single-stage approach requires determination of significantly more correlation coefficients, it captures geologic dependencies among similar units in different basins and it is less sensitive to fluctuations in low correlation coefficients than the multiple stage approach. Thus, subsets of one single-stage correlation matrix are used to aggregate to basin, regional, and national scales.

Successive aggregation in multiple stages drastically reduces the number of pairwise correlations that must be elicited from geologists at the expense of requiring each assessor to appraise pairwise correlations of sums of assessment unit magnitudes. Although there are no studies comparing how well geologists' assessments calibrate when asked to appraise dependencies among sums of SAU magnitudes relative to appraisal of dependencies among individual SAUs it is reasonable to conjecture that individual SAU appraisals are much more likely to be well calibrated. Properties of single and multi-stage appraisal methods are studied in Kaufman et al. (2018).

### *5.3.3 Cupolas and Oil and Gas Resource Assessment*

Chen et al. (2012) emphasize that at an assessment micro-level, reservoir attributes such as porosity, permeability, pressure and temperature are often decisively dependent and that empirical data suggest dependencies are present among more aggregate assessment units in mature provinces—among fields in a mature play or basin for example. Their argument is that a basin's tectonic framework exerts "strong geographic control" over many geological features and leads to geographic and spatial dependencies and that because plays in a given basin share "…petroleum system elements, such as source rocks, regional top seal, migration fairways, timing, regional tectonics for trap formation, and accumulation preservation factors" a probabilistic model of pools or fields in a play in a given basin should incorporate probabilistic dependencies among these attributes as well as between plays. They are the first to use copulas in this setting.

Sklar (1959) proved that, subject to mild restrictions a multivariate cumulative distribution can be mapped into a joint cumulative distribution of uniform random variables called a cupola. As with Iman and Conover's method, adoption of a cupola model allows marginal distribution shapes to be estimated from data sets distinct from those used to estimate dependency structure.

Suppose as in Sect. 5.2 above that *FX* is the distribution function of a random vector **<sup>X</sup>** <sup>=</sup> <sup>ð</sup>*X*1, ... , *Xn*Þ *<sup>t</sup>* with domain **R***<sup>n</sup>* and marginal cumulative distributions *Fi*, *<sup>i</sup>*= 1, ... , *<sup>n</sup>*. Let **<sup>U</sup>***<sup>n</sup>* <sup>=</sup> <sup>ð</sup>*U*1, ... , *Un*<sup>Þ</sup> be a vector of independent uniform <sup>ð</sup>0, 1<sup>Þ</sup> random variables and **<sup>u</sup>***<sup>n</sup>* <sup>=</sup>ð*u*1, ... , *un*<sup>Þ</sup> be a realization of **<sup>U</sup>***n*. Then with *ui* <sup>=</sup> *Fi*ð*xi*Þ, *<sup>i</sup>*= 1, ... *n Prob*f*X*<sup>1</sup> <sup>≤</sup> *<sup>x</sup>*1, ... , *Xn* <sup>≤</sup>*xn*g<sup>=</sup> *Prob*f*U*<sup>1</sup> <sup>≤</sup> *<sup>u</sup>*1, ... , *Un* <sup>≤</sup> *un*g.

**Definition** *<sup>C</sup>*ð*u*1, ... , *un*Þ=*Prob*f*U*<sup>1</sup> <sup>≤</sup>*u*1, ... , *Un* <sup>≤</sup>*un*Þg is the cupola of *FX*.

Set *dFi* <sup>=</sup> *fi* , *<sup>i</sup>*= 1, ... , *<sup>n</sup>* and *dC*ð*u*1, ... , *un*Þ=*c*ð*u*1, ... , *un*Þ*du*<sup>1</sup> ... *dun*. The joint density of **<sup>X</sup>** can be written as *<sup>c</sup>*ð*u*1, ... , *un*Þ <sup>×</sup> *<sup>f</sup>*1ð*x*1Þ <sup>×</sup> ... <sup>×</sup> *fn*ð*xn*Þ. The term *<sup>c</sup>* in the joint density captures the dependency structure of elements of **X**. Because *Prob*f*X*<sup>1</sup> <sup>≤</sup> *<sup>x</sup>*1, ... , *Xn* <sup>≤</sup> *xn*g<sup>=</sup> *Prob*f*U*<sup>1</sup> <sup>≤</sup>*u*1, ... , *Un* <sup>≤</sup>*un*<sup>g</sup> a procedure for generating samples from *<sup>C</sup>* produces samples of **<sup>X</sup>** by inversion of *ui* <sup>=</sup> *Fi*ð*xi*Þ, *<sup>i</sup>*= 1, ... *<sup>n</sup>*.

Computation requires choice of a cupola functional form. Among a variety of choices Chen et al. chose the bivariate normal cupola, a popular choice closely tied to standard multivariate normal distribution theory.

Their regional resource assessment of the Canadian Arctic's Beaufort-McKenzie Basin is based on analysis of 48 "significant" oil and gas discoveries containing 53 distinct accumulations. Empirical data is sufficiently detailed to allow study and estimation of pairwise correlations among reservoir attributes—area, porosity, oil saturation, net pay—for plays in the three major petroleum systems. The authors treat geologic risk factors as probabilistically independent because the data is not sufficient to allow empirical estimation of them and restrict their study of dependencies to reservoir volume attributes within each play and through them to the impact of probabilistic dependencies on the distribution of total resource volumes.

Four plays, Ivik, Taglu, Kugmallit (East) and Kugmallit (West) are used to illustrate how to incorporate dependencies among individual play resources. Although no systematic method for eliciting geologists' judgments about between play dependencies are discussed the authors motivate their choice of a rather large correlations between plays (0.6) and perfect correlation (1.0) by noting that all four plays share the same source rock and petroleum system: "The resource richness of each play is basically a function of both the oil charge and the preservation of accumulations that are mostly controlled by common petroleum system elements… we infer that the resources in the four plays are highly correlated, although the pool size distributions among the four plays vary considerably." Pairwise correlations between area, net pay, porosity and oil saturation vary from a low of 0.20 to a high of 0.86. The authors call attention to the substantial difference between total ultimate oil resource medians under the assumption of independence and under the assumption of within and between play correlations: the latter is 1.6 times the former.

Principal messages are that to be realistic, probabilistic appraisal of oil and gas resources in unexplored and partially explored regions must account for multiple sources of dependencies and that cupolas are useful for doing so.

### **5.4 Concluding Remarks**

In the absence of empirical data that allows resolution of the vexing problem of how to address probabilistic dependencies among and between elements of large sets of geologic random variables we need methods that refocus and streamline expert geological judgment inputs as well as analytical methods for modeling dependencies that go beyond pairwise correlation and its cousins. One promising avenue is the theory of vines proposed by Bradford and (2002). Their theory broadens the range of allowable dependency structures beyond Bayesian belief networks and exploits properties of rank correlations in a fashion that leads to efficient computation.

### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **Chapter 6 A Statistical Analysis of the Jacobian in Retrievals of Satellite Data**

**Noel Cressie**

**Abstract** Remote sensing has become an essential component of the geosciences (the study of Earth and its system components). Remote sensing measurements are almost always energies measured in selected parts of the electro-magnetic spectrum. That is, the geophysical variable of interest is only observed indirectly; a forward model relates the energies to the variable(s) of interest and other elements of the state. The first derivative of that forward model with respect to the state is known as the Jacobian. In this chapter, we review the importance of the Jacobian to inferring the state, and we use it to diagnose which state elements may be difficult to estimate. We develop the Statistical Significance Filter and flag those state elements that consistently fail to get through the filter.

### **6.1 Introduction**

Remote sensing of the environment is a fundamentally important part of humans' quest to understand the Earth system and how the different components interact (e.g., climate, water, carbon). In the future, this knowledge may be critical to our survival. Satellite and aircraft campaigns allow a "bird's-eye view" of large parts of Earth, but not all campaigns are alike. For example, polar-orbiting satellites allow global coverage, passive instruments rely on the sun's reflected light and do not take measurements when there are clouds or when it is night, and programs such as NASA's ASCENDS will measure day or night, anywhere on the orbit track.

In this chapter, a passive instrument on a polar-orbiting satellite, namely Japan's Greenhouse Gases Observing Satellite (GOSAT), will be used as a leading example. However, the idea behind what I shall present is general and could apply to many remote sensing inversion problems involving a non-linear forward model. In such

N. Cressie (✉)

Distinguished Professor, National Institute for Applied Statistics

Research Australia (NIASRA), School of Mathematics and Applied Statistics, University of Wollongong, Wollongong, Australia

e-mail: ncressie@uow.edu.au

<sup>©</sup> The Author(s) 2018 B. S. Daya Sagar et al. (eds.), *Handbook of Mathematical Geosciences*, https://doi.org/10.1007/978-3-319-78999-6\_6

problems, the goal is to infer a hidden state from energies detected by an instrument sensitive to certain known bands of the electro-magnetic spectrum.

Section 6.2 of this chapter gives a statistical framework behind the problem of uncertainty quantification of retrieved states. Section 6.3 calls out the Jacobian matrix as an important component of the retrieval algorithm and defines a unit-free Jacobian for subsequent statistical analysis. That analysis is described in Sect. 6.4, where a Statistical Significance Filter is defined. In Sect. 6.5, this methodology is applied to a number of retrievals taken over Australia, where certain state elements are flagged as being potentially difficult to estimate. The last section, Sect. 6.6, finishes with a discussion of the results obtained.

### **6.2 A Statistical Framework for Satellite Retrievals**

The biases, variances, and mean squared prediction errors of retrievals need to be calculated in the general setting of a nonlinear forward model. The book by Rodgers (2000) has a section on error analysis, but it approaches the problem mostly from a numerical-sensitivity viewpoint. The strongly statistical viewpoint given here calculates the first two moments of a retrieval and the distribution of elements of the associated Jacobian matrix (defined below as **K**). In the case where relationships are non-linear, the well known "delta method" (based on Taylor-series expansions; e.g., Meyer 1975, Chap. 10) gives *approximate* (to leading orders) biases and mean squared prediction errors of the estimators (Cressie and Wang 2013).

The *n*-dimensional radiances **Y** are related to the *n*-dimensional state **X** through a non-linear forward model,

$$\mathbf{Y} = \mathbf{F}(\mathbf{X}) + \boldsymbol{\varepsilon},\tag{6.1}$$

where the state vector **X** includes volume mixing ratios of CO<sup>2</sup> at prespecified geopotential heights, the error vector ∼ Gau(**0***,* **S**), and **X** and are statistically independent. Further, there is an *a priori* assumption that

$$\mathbf{X} = \mathbf{X}\_a + a,\tag{6.2}$$

where ∼ Gau(**0***,* **S**). Notice that if there is consistent bias present in the retrieval, this can be accounted for by adding it to **X**, leaving the assumption, ∼ Gau(**0***,* **S**), intact. Define the matrices,

$$\mathbf{K}(\mathbf{x}) \equiv \frac{\partial \mathbf{F}(\mathbf{x})}{\partial \mathbf{x}} \equiv \left(\frac{\partial F\_i(\mathbf{x})}{\partial x\_j} \; ; \; i = 1, \dots, n\_e; j = 1, \dots, n\_a\right) \tag{6.3}$$

$$\mathbf{G(x)} \equiv \{ \mathbf{S}\_a^{-1} + \mathbf{K(x)}' \mathbf{S}\_\varepsilon^{-1} \mathbf{K(x)} \}^{-1} \mathbf{K(x)}' \mathbf{S}\_\varepsilon^{-1} \tag{6.4}$$

$$\mathbf{A(x)} \equiv \mathbf{G(x)} \mathbf{K(x)},\tag{6.5}$$

where **x** is any atmospheric state. (Recall that the true state is denoted as **X**.)

#### 6 A Statistical Analysis of the Jacobian in Retrievals of Satellite Data 119

The *n* × *n* matrix **K**(⋅) is called the *Jacobian*. Partial derivatives of **K**(⋅) represent the degree of non-linearity in the forward model. In the case of a *linear* forward model, **K** is constant, and any partial derivatives of it are zero.

An estimate of **X**, sometimes called a *retrieval*, is often obtained by choosing an **<sup>X</sup>***̂* that allows **<sup>F</sup>**(**X***̂* ) to be "close to" **<sup>Y</sup>**, subject to smoothness conditions on **<sup>X</sup>***̂* . This regularisation is usually defined as follows: Minimise

$$(\mathbf{Y} - \mathbf{F}(\mathbf{X}))^{\prime} \mathbf{S}\_{\varepsilon}^{-1} (\mathbf{Y} - \mathbf{F}(\mathbf{X})) + (\mathbf{X} - \mathbf{X}\_{a})^{\prime} \mathbf{S}\_{a}^{-1} (\mathbf{X} - \mathbf{X}\_{a}) \tag{6.6}$$

with respect to **<sup>X</sup>**, which results in the retrieval **<sup>X</sup>***̂* .

The *n* × *n* matrix **G**(⋅) represents a type of *"gain" matrix* in the relationship between retrieval **<sup>X</sup>***̂* and data **<sup>Y</sup>**; that is,

$$
\hat{\mathbf{X}} = \mathbf{X}\_a + \mathbf{G}(\hat{\mathbf{X}})(\mathbf{Y} - \mathbf{F}(\mathbf{X}\_a) - \mathbf{K}(\hat{\mathbf{X}})\mathbf{X}\_a) + \text{ "remainder" }.
$$

In the linear case, **G** is constant and the "remainder" term is zero.

The *n* × *n* matrix **A**(⋅) yields the *averaging kernel matrix* in the relation between retrieval and true state; that is,

$$
\hat{\mathbf{X}} = \mathbf{X}\_a + \mathbf{A}(\hat{\mathbf{X}})(\mathbf{X} - \mathbf{X}\_a) + \text{``remainder''}.
$$

In the linear case, **A** is constant, the "remainder" term is **G**, and recall that is independent of **X**.

In this section, I discuss the bias vector and the mean-squared-prediction-error (MSPE) matrix of the retrieval, **<sup>X</sup>***̂* . The bias vector is defined as:

$$E(\hat{\mathbf{X}} - \mathbf{X}) = E(\hat{\mathbf{X}}) - E(\mathbf{X}) = E(\hat{\mathbf{X}}) - \mathbf{X}\_a, \dots$$

where recall that **X** is the prior mean of the state vector **X**.

The MSPE matrix is defined as:

$$E((
\hat{\mathbf{X}} - \mathbf{X})(
\hat{\mathbf{X}} - \mathbf{X})') = \text{var}(
\hat{\mathbf{X}} - \mathbf{X}) + (E(
\hat{\mathbf{X}}) - \mathbf{X}\_a)(E(
\hat{\mathbf{X}}) - \mathbf{X}\_a)',$$

where var(**X***̂* <sup>−</sup> **<sup>X</sup>**) is the covariance matrix of the retrieval error, **<sup>X</sup>***̂* <sup>−</sup> **<sup>X</sup>**. The MSPE matrix can be a more appropriate statistical measure of uncertainty than the covariance matrix of retrieval error when there is bias present. When the bias is zero, the two measures of uncertainty are the same.

When the forward model is *linear*, it is easily seen (e.g., Rodgers 2000) that the bias vector,

$$E(\hat{\mathbf{X}} - \mathbf{X}) = \mathbf{0}.\tag{6.7}$$

That is, in the linear case, **<sup>X</sup>***̂* is *unbiased*. Further, in the linear case, the MSPE matrix can be derived *exactly* and written in a number of equivalent ways. From Connor et al. (2008), Cressie and Wang (2013),

$$E((
\hat{\mathbf{X}} - \mathbf{X})(\hat{\mathbf{X}} - \mathbf{X})') = E(\text{var}(\mathbf{X}|\mathbf{Y})) \equiv \hat{\mathbf{S}}\,\,,\tag{6.8}$$

where the MSPE matrix is given by

$$\hat{\mathbf{S}} = \{ \mathbf{S}\_a^{-1} + \mathbf{K}^\prime \mathbf{S}\_\varepsilon^{-1} \mathbf{K} \}^{-1} = (\mathbf{A} - \mathbf{I}) \mathbf{S}\_a (\mathbf{A} - \mathbf{I})^\prime + \mathbf{G} \mathbf{S}\_\varepsilon \mathbf{G}^\prime \,. \tag{6.9}$$

When the forward model is *nonlinear*, the bias of **<sup>X</sup>***̂* is *nonzero*, and the equalities in (6.9) are no longer true. However, from the "delta method," Cressie et al. (2016) show that (6.7) and (6.9) hold, *to leading order*. In what follows, a leading-order analysis is carried out. This amounts to assuming the forward model to be locally linear, which is a weaker assumption than assuming global linearity, namely **Y** = **c** + **KX** + , across the whole state space defined by all possible values of **X**.

The locally linear forward model is derived using a Taylor-series expansion:

$$\begin{split} \mathbf{Y} &= \mathbf{F}(\mathbf{X}) + \boldsymbol{\varepsilon} \\ &= \mathbf{F}(\mathbf{X}\_{0}) + \left. \frac{\partial \mathbf{F}(\mathbf{x})}{\partial \mathbf{x}} \right|\_{\mathbf{x} = \mathbf{X}\_{0}} \times (\mathbf{X} - \mathbf{X}\_{0}) + \boldsymbol{\lambda}, \\ &\equiv \mathbf{c}(\mathbf{X}\_{0}) + \mathbf{K}(\mathbf{X}\_{0})\mathbf{X} + \boldsymbol{\lambda}, \end{split}$$

where models the lack of fit of the local linear model (about the linearisation point **x** = **X**0) to **F**(**X**). The linearisation point **X**<sup>0</sup> is often chosen to be the prior mean **X**, but I want to emphasise here that it need not be.

### **6.3 The Jacobian Matrix and its Unit-Free Version**

The Jacobian matrix is the first derivative of the *n*-dimensional forward function vector, **F**(**x**), with respect to the *n*-dimensional state **x**. From the definition given in (6.3), it is an *n* × *n* matrix. Write the matrix as (*Kij*), and note that the units of *Kij* are radiance (energy) per unit of state-space element *j*.

Define the vectors,

$$\begin{aligned} (\sigma\_{\varepsilon,1}^2, \dots, \sigma\_{\varepsilon,n\_{\varepsilon}}^2)' &\equiv \text{diag}(\mathbf{S}\_{\varepsilon}), \\ (\sigma\_{a,1}^2, \dots, \sigma\_{a,n\_a}^2)' &\equiv \text{diag}(\mathbf{S}\_a), \end{aligned}$$

where diag(⋅) is a matrix operator that extracts a vector made up of the matrix's diagonal elements. Then the *unit-free Jacobian* is defined as follows:

$$\{\phi\_{ij}\equiv K\_{ij}\sigma\_{a,j}/\sigma\_{e,i} \text{ ; } i=1,\ldots,n\_e, j=1,\ldots,n\_a.\tag{6.10}$$

During the retrieval, the most difficult and time-consuming part is to minimise (6.6); for example, using a Levenberg-Marquardt algorithm requires evaluation of the Jacobian matrix at each iteration of the minimisation. Let *<sup>K</sup>̂ ij* be a generic Jacobian element used during the retrieval. Then define the corresponding unit-free version as,

$$
\hat{\phi}\_{\vec{\imath}\vec{\jmath}} \equiv \hat{K}\_{\vec{\imath}\vec{\jmath}} \sigma\_{a,\vec{\jmath}} / \sigma\_{a,\vec{\imath}}\,,\tag{6.11}
$$

and denote *̂* <sup>≡</sup> (*̂ ij*) as the *n* × *n unit-free Jacobian matrix*.

For satellite retrievals, the data vector **Y** can often be partitioned as

$$\mathbf{Y} = (\mathbf{Y}'\_1, \dots, \mathbf{Y}'\_K)',$$

where

$$\mathbf{Y}\_k \equiv (Y\_i \; :\; i \in \text{band}\_k)',\tag{6.12}$$

and band1*,*…*,* band*<sup>K</sup>* are mutually exclusive index sets that represent a grouping of radiances according to which bands of the electro-magnetic spectrum they belong. For example, Japan's GOSAT and NASA's Orbiting Carbon Observatory-2 (OCO-2) instruments have *K* = 3 bands, corresponding to the oxygen A band (OA), the weak carbon dioxide band (WC), and the strong carbon dioxide band (SC); our analysis in Sect. 6.5 uses data from GOSAT's three bands. Another example is from NASA's Atmospheric Infrared Sounder (AIRS) instrument flying on the Aqua satellite, which has *K* = 4 bands, corresponding to four geophysical variables, namely temperature, water vapour, ozone, and carbon dioxide.

In what follows, we abbreviate "band*k*" to "*bk*." Because the unit-free Jacobian has elements that are potentially comparable, we can partition it and analyse it in comparable ways. Recall that the index *j* corresponds to a given element of the state vector, for example, a water-vapour scale factor or a near-surface carbon-dioxide volume mixing ratio. Then fix the state element *j*, and consider the behaviour of the *j*th column as row *i* varies within individual bands. That is, for a fixed *j*, consider

$$\{\hat{\phi}\_{\vec{y}} \, : \, i \in b\_k\} \tag{6.13}$$

to be a random sample from a distribution indexed by *k*, for bands *k* = 1*,*…*,K*.

Consequently, instead of thinking about *n* ⋅ *n* entries in the Jacobian, attention turns to *n* ⋅ *K distributions*. For example, for the retrievals from GOSAT data that are being considered here, *n* = 2240, *n* = 112, and *K* = 3. Hence, the pair (*j, k*) indexes one of 336 possible distributions, whose mean, *jk*, is of primary interest. For *j* a fixed element of the state vector, if *<sup>j</sup>*<sup>1</sup> = *<sup>j</sup>*<sup>2</sup> = ⋯ = *jK* = 0, then that element is poorly determined by the data alone; see Sect. 6.4. This is a flag that says the (prior) mean and precision of the *j*th state element need to be specified very carefully in the second term of (6.6) in order to obtain an acceptably precise retrieval *<sup>X</sup>̂ j* .

### **6.4 Statistical Significance Filter**

To leading order, the forward model (6.1) can be written as,

$$\mathbf{Y} = \mathbf{c} + \mathbf{K}\_1 X\_1 + \dots + \mathbf{K}\_{n\_a} X\_{n\_a} + \varepsilon,\tag{6.14}$$

which is a multiple-regression model with known, typically different, intercepts given by the elements of **c**; known covariates **K**1*,*…*,* **K***n* (the *n* columns of **K**); and unknown regression coefficients *X*1*,*…*, Xn* . Clearly, if **K***<sup>j</sup>* is zero, then *Xj* will not be estimable. Further, if for a given *<sup>j</sup>*, {|*Kij*<sup>|</sup> <sup>∶</sup> *<sup>i</sup>* = 1*,*…*, <sup>n</sup>*} are uniformly "small," then the uncertainty associated with the estimate of *Xj* will be large.

In the previous section, we noted that for remote sensing retrievals, the *n* elements in **Y** can be partitioned into *K* bands, **Y**1*,*…*,* **Y***K*. Then write (6.14) equivalently as *K* equations. In obvious notation that respects the partitioning,

$$\mathbf{Y}\_k = \mathbf{c}\_k + \mathbf{K}\_{1k}X\_1 + \dots + \mathbf{K}\_{n\_ak}X\_{n\_a} + \boldsymbol{\varepsilon}\_k; \, k = 1, \dots, K,\tag{6.15}$$

where {**K***jk* ∶ *j* = 1*,*…*, n*} are the *n* vectors corresponding to the *k*th band.

Clearly, if **K***jk* = **0**, then its unit-free version, *jk*, is also **0**. Hence, the problem of whether *Xj* is poorly determined in the forward model (6.1) can be addressed in a statistical manner by considering the retrieval's unit-free Jacobian entries {*̂ ij* ∶ *i* = <sup>1</sup>*,*…*, <sup>n</sup>*} as *<sup>K</sup>* arrays of random variables, {*̂ ij* ∶ *i* ∈ *bk*}, for *k* = 1*,*…*,K*. If, for a fixed *j*, the means *<sup>j</sup>*1*,*…*, jk* of these *K* arrays are all zero, then *Xj* will be difficult to estimate.

## *6.4.1 Hypothesis Tests*

Consider (6.13) and make the following assumption: For a given retrieval, a given state element *j*, and a given band *k*,

$$\{\hat{\phi}\_{\vec{y}} \,:\, i \in b\_k\} \overset{iid}{\sim} \operatorname{Dist}(\mu\_{jk}),$$

where "iid" denotes "independent and identically distributed," and "Dist()" denotes a probability distribution with mean . For this retrieval, the idea is to flag those state elements and bands for which the null hypothesis, *H*0*,jk* ∶ *jk* = 0, is not rejected. In particular, failure to reject the composite hypothesis,

$$H\_{0,j} \; : \; \mu\_{j1} = \mu\_{j2} = \dots = \mu\_{jK} = 0 \; , \tag{6.16}$$

implies that the *j*th state element will be difficult to estimate in the given retrieval.

Since the elements of {*̂ ij* ∶ *i* ∈ *bk*} are considered to be a sample from a distribution with mean *jk*, I shall construct a test statistic from these unit-free Jacobian values. A considerable amount of exploratory data analysis showed the common distributional assumption within the partitioned arrays to be largely correct, with occasional gross outliers that would challenge many statistical testing procedures. Those were controlled by transforming each *̂ ij* to <sup>|</sup>*̂ ij*| 1∕2, and the robust test statistic,

$$\tilde{\phi}\_{jk} \equiv \text{med}\{ |\hat{\phi}\_{ij}|^{1/2} \; ; \; i \in b\_k \} \;, \tag{6.17}$$

was used to test *H*0*,jk* ∶ *jk* = 0. The composite hypothesis test {*H*0*,<sup>j</sup>* ∶ *j* = 1*,*…*, n*}, where *H*0*,<sup>j</sup>* is given by (6.16), is then carried out using a Bonferroni adjustment (Sect. 6.4.3).

## *6.4.2 Distribution Theory for the Robust Test Statistic*

Consider generic iid random variables *W*1*,*…*, Wm* distributed according to a Gaussian distribution with mean *<sup>W</sup>* and variance <sup>2</sup> *<sup>W</sup>* , which is written as Gau(*<sup>W</sup> ,* <sup>2</sup> *W* ). To test

$$H\_0: \,\,\mu\_W = 0 \,\,\text{versus} \,\, H\_1: \,\,\mu\_W \neq 0 \,\,,\tag{6.18}$$

consider the robust test statistic,

$$\tilde{X} \equiv \text{med}\{ |W\_i|^{1/2} \; ; \; i = 1, \ldots, m \}. \tag{6.19}$$

I now obtain distribution theory for *<sup>X</sup>̃* under the null hypothesis in order to carry out a significance test.

If *<sup>Y</sup>* <sup>∼</sup> Gau(0*,* 1), then *<sup>E</sup>*(|*Y*<sup>|</sup> 1∕2)=0*.*<sup>82216</sup> and var(|*Y*<sup>|</sup> 1∕2)=0*.*12192, which was derived by Cressie and Hawkins (1980). Then under *<sup>H</sup>*<sup>0</sup> <sup>∶</sup> *<sup>W</sup>* = 0, <sup>|</sup>*Wi* | 1∕2 ⋅ ∼ Gau(0*.*82216 ⋅ 1∕2 *<sup>W</sup> ,* <sup>0</sup>*.*<sup>12192</sup> <sup>⋅</sup> *<sup>W</sup>* ), where " <sup>⋅</sup> ∼" denotes "is approximately distributed as," and the approximation is established by Cressie and Hawkins (1980). Now the distribution of the median *<sup>X</sup>̃* from a random sample *<sup>X</sup>*1*,*…*, Xm* of Gaussian random variables can be approximated as Gaussian with mean *<sup>E</sup>*(*X̃*) = *<sup>E</sup>*(*X*1), and variance var(*X̃*) = var(*X*1)∕2*m*. If all these results are combined, then under the null hypothesis *H*<sup>0</sup> in (6.18),

$$\tilde{X} \stackrel{\sim}{\sim} \text{Gau}(0.82216 \cdot \sigma\_W^{1/2}, 0.12192 \cdot \pi \sigma\_W / 2m) \dots$$

Clearly, the alternative hypothesis *<sup>H</sup>*<sup>1</sup> in (6.18) is accepted if the test statistic *<sup>X</sup>̃* is large. At significance level , *H*<sup>1</sup> is accepted if

$$\tilde{X} > 0.82216 \cdot \sigma\_W^{1/2} + \Phi^{-1}(1 - a)(0.12192 \cdot \pi \sigma\_W / 2m)^{1/2},\tag{6.20}$$

where Φ−1(⋅) is the inverse cumulative distribution function of a Gau(0*,* 1) random variable. In practice, an estimate of *<sup>W</sup>* will be needed.

Continuing with the same approach as above, an asymptotically unbiased, robust estimator of *<sup>W</sup>* is used. Now, *<sup>W</sup>* <sup>=</sup> var(|*Wi* | 1∕2)∕0*.*12192, and hence var(|*Wi* | 1∕2) can be estimated using the median absolute deviation (MAD):

$$\text{MAD} \equiv \text{med}\{ ||W\_i||^{1/2} - \tilde{X} | \, : \, i = 1, \dots, m \} \dots$$

Then an asymptotically unbiased estimator of var(|*Wi* | 1∕2) is

$$\text{vâr}(|W\_i|^{1/2}) = (1.4826 \cdot \text{MAD})^2,$$

from which the estimator

$$
\tilde{\sigma}\_W \equiv (1.4826 \cdot \text{MAD})^2 / 0.12192 \tag{6.21}
$$

is obtained and substituted into (6.20).

My approach to constructing this robust statistic to test whether a mean is zero, using data that may contain large, unpredictable outliers, is somewhat unusual, but it is statistically advantageous. First, the data {*W*1*,*…*, Wm*} are made resistant by transforming to the square-root scale where variability is dampened. Then the transformed data {|*W*<sup>1</sup><sup>|</sup> 1∕2*,*…*,* <sup>|</sup>*Wm*<sup>|</sup> 1∕2} are used to define a robust test statistic, given here by the median; see (6.19). Finally, the null distribution is derived, resulting in a critical region given by (6.20) with the robust estimator (6.21) substituted in. In the next subsection, the distribution theory derived in this subsection is used in the context of multiple hypothesis testing, resulting in the *Statistical Significance Filter*.

### *6.4.3 Multiple Hypothesis Tests Define the Statistical Significance Filter*

The elements of the unit-free Jacobian are considered as replicates within bands, which results in *n* (number of state elements) times *K* (number of bands) hypothesis tests of {*H*0*jk* ∶ *jk* = 0, for *j* = 1*,*…*, n* and *k* = 1*,*…*,K*}. To test *H*0*<sup>j</sup>* given by (6.16), jointly for *j* = 1*,*…*, n,* I use a family-wise error rate of 1% and conservative Bonferroni adjustments to obtain a level of significance, = *.*01∕(*n* ⋅ *K*), that is used in each individual hypothesis test of the null hypotheses, {*H*0*jk*}.

The *Statistical Significance Filter* only allows estimates {*̃ jk*} to get through the filter if {*H*0*jk*} are rejected, respectively. A given state element, *j* say, is flagged as problematic in a given retrieval if, simultaneously, the hypotheses *H*0*j*1*,*…*, H*0*jK* are not rejected. If it consistently happens that under similar (or different) geophysical conditions, the *j*th element's bands fail to get through the Statistical Significance Filter, that element *Xj* is flagged as being weakly sensitive to the radiance measurements **Y**. Hence, estimation of *Xj* would be difficult if a very disperse prior distribution in (6.2) were chosen for it.

In the next section, I apply the Statistical Significance Filter to 30 retrievals from Japan's GOSAT instrument that measures atmospheric carbon dioxide, here over central Australia.

### **6.5 ACOS Retrievals of the Atmospheric State from Japan's GOSAT Satellite**

Shown in Fig. 6.1 are 30 locations of retrievals from Japan's GOSAT satellite, where the ACOS (Atmospheric CO2 Observations from Space) retrieval algorithm was used. Specifically, ACOS Version B2.8 was used here, for which *n* = 112 state elements were retrieved from *n* = 2240 radiances spread roughly equally between the *K* = 3 bands, namely the OA band, the WC band, and the SC band; see Sect. 6.3. The soundings are over an arid part of Australia with uniformly high albedo, during the period from 5 June 2009–26 July 2009 (Source: CIRA, Colorado State University). The methodology and inference is illustrated on the retrieval at one of those locations, hereafter referred to as Location 1. Results from the other 29 retrievals are summarised at the end of this section.

A number of the state elements in B2.8 are functions of geopotential height, here labelled as 1 (top of atmosphere) down to 20 (surface of Earth). Figure 6.2 shows unit-free ice-cloud Jacobian values in a column of the atmosphere for Location 1; only those values that got through the Statistical Significance Filter are shown. It can be seen that for the ice-cloud variable, Jacobian values in the OA band are not statistically significant at higher altitudes in the atmospheric column, and hence they are potentially difficult to estimate. Figure 6.3 shows that the Statistical Significance Filter applied to water vapour (H2O) in the column results in a similar set of plots. Contrast these to Fig. 6.4, which is for the all-important carbon-dioxide (CO2) variable; only values in the SC band get through the Statistical Significance Filter.

The analysis of the retrieval for Location 1 yields non-significant Jacobian entries (i.e., forward-model derivatives near zero) *in all three bands* for the following state elements:


This behavior is visualised in Fig. 6.5; there, a light (green) stripe in a given band for a given state element indicates that the corresponding mean is not significantly different from zero. A light stripe in every band for the given state element indicates that extra care will be needed when specifying a prior for that element. Each of the 11 elements listed above have a light stripe in every band.

The analysis was carried out on all 30 retrievals, and eight elements of the 112 dimensional state vector emerged as always having non-significant Jacobian values in all three bands for all 30 retrievals. They were:

**Fig. 6.1** Locations of 30 retrievals from GOSAT using the ACOS Version B2.8 retrieval: 5 June 2009–26 July 2009

**Fig. 6.2** Unit-free Jacobian ice-cloud values that pass through the statistical significance filter in the OA, WC, and SC bands. Values that did not pass through the filter are not plotted. Location 1 (out of 30 locations)

**Fig. 6.3** Unit-free Jacobian H2O values that pass through the statistical significance filter in the OA, WC, and SC bands. Values that did not pass through the filter are not plotted. Location 1 (out of 30 locations)

**Fig. 6.4** Unit-free Jacobian CO<sup>2</sup> values that pass through the statistical significance filter in the OA, WC, and SC bands. Values that did not through pass the filter are not plotted. Location 1 (out of 30 locations)

**Fig. 6.5** A graphic showing which of the 112 elements of the state vector (horizontal axis) pass through the statistical significance filter (dark, red colour) and which do not (light, green colour), for "band" = OA, WC, and SC. Location 1 (out of 30 locations)


The results indicate a lack of sensitivity of these eight elements in the forward equation **F** given in (6.1), for the dry, bright, flat-terrain conditions found over central Australia. Different land surfaces and atmospheric states would almost certainly result in different elements being identified.

### **6.6 Discussion**

The Jacobian matrix **K** is the first derivative of a vector-valued function **F**(**x**) of a state vector **x**. Consistently small elements in the *j*th column of **K** indicate that the *j*th element will be difficult to estimate (predict) based on data, **Y**, alone.

If prior information, as well as the data, is used to predict the state vector, this research indicates that acceptable precision for estimating this *j*th element may require the prior variance to be tightly constrained. For example, the element that is the H2O scale factor is tightly constrained physically in the prior. Thus, a retrieval of that element may cause no problem, even though its column in **K** fails to get through the Statistical Significance Filter. Regarding the 20 CO<sup>2</sup> elements that make up the CO<sup>2</sup> profile in the atmospheric column, the retrievals analysed here show the importance of the strong CO<sup>2</sup> band (SC) to its estimation. The best result would be if all 20 ⋅ 3 = 60 hypothesis tests were rejected; at Location 1, only 17, all in the SC band, were rejected (Fig. 6.4).

Current versions of ACOS-like retrievals have between 40–50 state elements. The research presented here, on the statistical properties of the Jacobian, would allow a comparison of different versions through the behaviour of their unit-free Jacobian values. Common to all of these versions is 20 CO<sup>2</sup> elements, and the respective estimates of the means in each of the three bands (OA, WC, SC) can be compared across versions.

**Acknowledgements** This research was supported by NASA grant NNH11-ZDA001N-OCO2 and a 2015–2017 Australian Research Council Discovery Project, number DP150104576. My thanks go to Rui Wang for his early input into the research and to Ben Maloney for his careful and timely assistance with preparation of the manuscript.

### **References**


Meyer SL (1975) Data analysis for scientists and engineers. Wiley, New York

Rodgers CD (2000) Inverse methods for atmospheric sounding. World Scientific Publishing, Singapore

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **Chapter 7 All Realizations All the Time**

**Clayton V. Deutsch**

**Abstract** Geostatistical simulation of mineral deposits is becoming commonplace. The methodology and software are well established and professionals have access to the training and checking steps required for reliable application. Managing multiple realizations, however, remains daunting and unclear for many: (1) the non-uniqueness of multiple realizations is disturbing; (2) many calculations including mine planning algorithms are aimed at a single block model; and (3) there are concerns of excessive computational requirements. The correct approach to managing multiple realizations is reviewed: consider all realizations all the time and base decisions on the appropriate expected value. The principles of simulation and decision making are reviewed for resource management.

### **7.1 Introduction**

In the context of modern geostatistics, Monte Carlo Simulation (MCS) or simply simulation can be summarized by (1) the formulation of a problem with input variables, a transfer function and response variables, (2) the simulation of realizations of the input variables, (3) the application of the transfer function to compute the response variables of interest, and (4) the assembly of the simulated response variables into a probability distribution. The distribution of response variables can be used to understand uncertainty and, perhaps, for decision making.

The input variables could be the rock type and grade on a suitable grid, the transfer function could be the calculation of resources and the response variables could be the resources or reserves expressed as tonnages, grade and quantity of metal. A comprehensive simulation study could expand the input variables to include modeling parameters, price, costs and other economic and engineering parameters. The transfer function could be a model of the entire mine planning and economic forecasting process. The response variables could be key performance

C. V. Deutsch (✉)

University of Alberta, Edmonton, Canada e-mail: cdeutsch@ualberta.ca

<sup>©</sup> The Author(s) 2018 B. S. Daya Sagar et al. (eds.), *Handbook of Mathematical Geosciences*, https://doi.org/10.1007/978-3-319-78999-6\_7

indicators such as net present value. The probability distributions of the input variables must be established prior to simulation; typically by a mathematical model such as the multivariate Gaussian model. The transfer function must be known to process realizations of the input variables to response variables of interest.

The key operation in simulation is the drawing of realizations from a specified probability distribution. This is done in a fair manner for unbiased results. Pseudorandom number generators generate numbers that have properties very close to random numbers, but are indexed to a seed. These numbers are uniform between 0 and 1, yet our input distributions are rarely uniform so the corresponding quantile is drawn from the distribution we are simulating from, *z* = *F*−<sup>1</sup> (*r*) where *F*(*•*) is the cumulative distribution, *z* is the simulated value and *r* is the random number.

Consider a simple example of three dice. The input variables are the three numbers showing on the faces of three fair cubic dice. The cumulative distribution of each input variable has six equal steps. The transfer function is the summation operator. The response is the sum. As illustrated on Fig. 7.1, one realization is generated by three random numbers, e.g., 0.69, 0.062 and 0.78 leading to a realization of 5, 1 and 5. Simulation is repeated for multiple realizations. The response distribution shown is the result for 100 realizations. There are many points that could be reinforced from this small example. The distribution of the input variables must be known prior to simulation; simulation is primarily transferring input variable uncertainty through a transfer function to response variable uncertainty. The space of uncertainty in this example is only 6<sup>3</sup> = 216, which is a very small number, but the space of uncertainty is practically infinite in geological modeling where there are many variables at many locations. Categorical variables require an arbitrary ordering. Finally, it would be wrong to focus on one realization; in this example we should not conclude that the first and third dice are likely high numbers and the second die is a low number. We only understand the result of simulation by considering an ensemble of realizations. This is a critical point.

Although many theoreticians and practitioners understand this point, it is not emphasized enough. Most software is aimed at processing one block model at a time. Resources are often presented as a single value instead of a distribution.

**Fig. 7.1** Simulation of the outcome on three dice (left). Histogram of the sum of the outcomes on three dice (right). One hundred realizations are shown

There are many examples of experimental mathematics in history. The scientists on the Manhattan Project are credited with the formulation and popularization of Monte Carlo simulation (MCS) or simply simulation. There are interesting historical references and internet resources. The framework of transferring input uncertainty to response variable uncertainty is often referred to as simulation. Adjectives such as Monte Carlo, stochastic or conditional are sometimes added. The outcomes of simulating are called simulations or realizations.

The pioneers of simulation suspected where we would take the method. The closing paragraph in Hammersley and Handscomb (1964) is telling: *Usually there are many nodes and possible paths, so many that a complete enumeration of the situation is impossible. This suggests a fruitful field for sampling and search procedures, but as yet little Monte Carlo work has been done here. There are challenging problems here for research into Monte Carlo techniques on multivariable problems*. They knew we had to sample a reasonable set of realizations from the practically infinite space of uncertainty. They knew we would be challenged by multiple dependent variables. They did not know that 50 years later many practitioners would still struggle managing an ensemble of realizations.

This chapter is organized into five main sections supporting a case to use *all realizations all the time*. First, some principles of simulation are presented to set the context. Second, principles of decision making in presence of uncertainty are discussed to establish that earth scientists are not alone. Thirdly, some details of geostatistical simulation are presented to highlight important differences from simulation of independent variables. Fourthly, some details of resource decision making are presented to highlight important differences from the general principles including the information effect. Finally, some possible alternatives to using all realizations all the time are reviewed. A case is made to consider the correct approach, that is, consider all realizations all the time and base decisions on the appropriate expected value when required.

### **7.2 Simulation**

In the early days of simulation there was a particular concern related to the pseudorandom numbers applied in the simulation. A large part of early texts on simulation is devoted to the generation of pseudorandom numbers. This concern has largely been addressed and there is little practical concern with the pseudorandom number generators used in most software.

Another concern is in replacing the reality with a numerical model. Many early applications of Monte Carlo simulation were directed at solving integration and other equations where the transfer function is a very close representation of the physical situation. Examples of well represented physical systems are the study of radiation shielding and reactor criticality. The simulation tracks simulated particles through collisions where the particles are absorbed, scattered or split according to physical principles. There were few concerns about this simulation due to the close correspondence between the numerical setup and the physical reality. Increasingly, complex non-linear systems are modeled with empirical statistical models causing more concern.

It is impossible to model the details of the natural geological processes that led to the deposit under study. Empirical statistical models are required. Geostatistical models do not represent the original depositional and diagenetic processes. Although all models are wrong (Box and Draper 1987) they can be useful if assembled carefully with established workflows and appropriate checking.

The premise of simulation is to construct many realizations that are equally likely to be drawn. Realizations and responses more probable than others will be drawn more often. A fundamental principle of simulation is to consider many realizations. One hundred realizations may not be enough. The average of the one hundred realizations on Fig. 7.1 was 9.6, yet the true expected response for that particular process is 10.5. This suggests that the number of realizations should be quite large. Indeed, early practitioners of simulation considered that thousands of realizations were required unless some form of stratified or directed sampling could be implemented. Of course, the problems considered early on were small compared to the complexity of modern geological modeling where 10s of variables at 10s of millions of locations are considered. In many cases, the professional and computational effort of generating more than 100s of realizations would be better spent improving the model. This claim is supported by two observations: (1) the variability at multiple locations partially cancels out, and (2) there is too much uncertainty in the model to expend resources on thousands of realizations.

Another fundamental principle of simulation is that all realizations are considered in downstream calculations. One application is to pass all realizations through the transfer function to construct a distribution of responses, for example, resource estimates. The realizations could be passed through a decision tree structure to help support a decision. Finally, the realizations could all be used in the optimization of decision variables. Incorrect or suboptimal decisions could be taken if too few realizations are considered.

The concept or ranking and choosing a few realizations is motivated by the large computational cost running realizations through a complex full physics transfer function. The processing the realizations through a simplified transfer function could rank the realizations and permit choosing a smaller number for the complex full physics transfer function. Decision making and optimization applied with one or a few realizations leads to over fitting to those realizations.

In some cases, the transfer function and decision variables are known. For example, calculating the recoverable reserves above a specified economic cutoff. In other cases, aspects of the decision must be optimized. For example, deciding the ultimate pit limits, choosing drill hole locations or deciding on the destination of mined material. If the transfer function and decision variables are known, then a probability distribution of each critical response variable is assembled from the realizations where the result of each realization is equally weighted. This distribution provides a direct understanding of uncertainty. There are many ways of summarizing the uncertainty. Considering the 0.1, 0.5 and 0.9 quantiles is common in petroleum applications, but considering the probability to be within 15% of expected is a reasonable measure of uncertainty.

If aspects of the decision are not finalized, then decision making and optimization must be considered before calculating a distribution of the critical response variables.

### **7.3 Decision Making**

Decision making in presence of uncertainty has long been studied (Bernoulli 1954; Kochenderfer 2015). The general framework of decision making could be summarized by (1) define clearly stated objectives within a value system, that is, a measure of utility (often profit), (2) enumerate the alternative decisions that could be taken—perhaps in a decision tree, (3) compute the expected utility for all alternatives, and (4) choose the alternative that maximizes expected utility. This framework becomes confounded with large one-time decisions or significant unknown unknowns that defy straightforward quantification. Grade control and mine planning decisions are made repeatedly within a clear economic framework.

Consider a recently loaded truck. The expected profit of the material if the truck goes to the mill would be computed by the average over all realizations, say \$6.75 per tonne. The expected profit if the material goes to the waste dump is the average of a similar calculation over all realizations, say −\$2.00 per tonne. With no other information, the truck should be sent to the mill. There are complicating factors including sequencing, stockpiling, limited milling capacity, but the principle stands. Decisions should be based on expected values as late as possible.

Decisions are based on the average over all realizations and not on one particular realization. The realizations are simply a means to represent uncertainty. One realization should not be chosen for decision making because that would mean ignoring other equally likely possibilities; the expectation is the only way to resolve the ambiguity of multiple realizations. The decision is also made as late as possible. Calling a block of material in a long term resource model ore may be convenient as an interim decision for planning purposes, but this decision would certainly be revisited with production sampling at the time of grade control.

There is another aspect to taking the expected value as late as possible. The expected value is calculated with the last numbers considered: utility or profit. The expected value should not be taken earlier. The correct decision would not always be found if the grades were averaged and the decision based on the utility computed from the expected grade. Many calculations are non-linear and the utility computed on the average of realizations is not the average (expected) utility computed on the realizations.

The distributions of payoff/utility for each possible decision are evaluated to determine the best decision. Some decisions may be completely dominated by others, that is, the best possible payoff of a dominated decision is less than the worst payoff of an alternative. All dominated decisions should be rejected. Some decisions are stochastically dominated by others (Levy 2016). That is, each quantile on the payoff distribution is less than the same quantile on an alternative. Decision makers should also reject all stochastically dominated decisions. The expected utility would be considered when multiple decisions remain to establish the optimal one.

A challenge in many geological resource application problems is that the decision involves many different options. The precise sequence of extraction or the position of all production wells is combinatorial and all options cannot be considered. Optimization algorithms are implemented where the objective function is the appropriate expected value of profit or utility over all realizations. The distribution of uncertainty in utility is only known once optimization is complete.

The utility function quantifies our position on risk; however, it is not simple to establish the utility function in practice. One approach based on the idea of the efficient frontier could be considered (Francis, and Dongcheol 2013; Hanoch and Levy 1969). Decisions are optimized based on maximum expected profit and minimum risk. The ones that are not dominated are retained as the efficient frontier. Judgement could be used to evaluate the differences between these decisions and to choose a path forward.

### **7.4 Geostatistical Simulation**

The simulation of mineral deposits has evolved significantly over the last twenty years. The simulation is often hierarchical and multivariate with unequally sampled data and parameter uncertainty. A variety of techniques are used to create realizations that reproduce all available data and represent the variability that may influence the planning and decision making process (Caers 2011; Chilès and Delfiner 2012).

The scope of this chapter is not to present details of geostatistical simulation (Deutsch and Journel 1998; Goovaerts 1997). The main steps in managing the results will be reviewed. The transfer functions of greatest interest are resources and reserves within reasonably large volumes, uncertainty versus data spacing, uncertainty and variability in mine planning and sometimes optimization of blending and other engineering designs. Parameter uncertainty is important for the resources within large volumes. Data uncertainty is important with unequally sampled variables (common with geometallurgical and geomechanical variables). The steps in geostatistical simulation could be divided into five unit operations.


extent possible. The process of conditioning the realizations will update the prior uncertainty quantified in the second step. A schematic illustration of the realizations is shown below.

• **Process in Transfer Function** involves evaluating every realization for all calculations of interest. Local uncertainty can be computed for any block size. Resources can be computed for the entire deposit, within a mine plan or for different elevations. An ultimate pit could be computed for every realization. The economic performance of each realization could be evaluated. The uncertainty in each response variable is known non-parametrically through the distribution of responses. The expected response can be computed as an average of the responses.

The uncertainty is directly observed. It is common to assess sensitivity by indexing each realization by summary input parameters, for example, the gross rock volume, proportions of rock types, average grades, variogram ranges, and correlation coefficients. Then, the relationship between the input parameters and the response variables can be fit by a response surface and the sensitivity evaluated and presented by tornado charts. Further post processing is discussed below.

### **7.5 Resource Decision Making**

All realizations should be used all the time. Anything that can be computed on one block model can be computed on one hundred, then the distribution of the response variable of interest can be assembled and summarized by expected value and other statistics. If a decision must be made, then the decision variable (economic value for ore, leach, dump…) can be computed on all realizations (Da Cruz 2000; Tversky and Kahneman 1992). The expected response determines the optimal decision.

When a mine plan is specified, then it is straightforward to evaluate all realizations through the plan and observe the uncertainty in key response variables due to the present state of incomplete knowledge. Sometimes the plan is not fixed and the realizations are to be used for planning and optimization. In principle this is not difficult. The objective function is the expected performance over all realizations. Some realizations may perform poorly with a particular plan and some better, but it is the expected value of the performance over all realizations that is the function to optimize (Pyrcz and Deutsch 2014). Considering the concept of the efficient frontier, the risk may be penalized to consider decisions that more reasonably suit the organizations position on risk.

Fixing a production plan and running multiple realizations through the plan can be somewhat pessimistic since this assumes the plan cannot change in the future. In fact, more data becomes available as mining proceeds and the plan can adapt to the new knowledge.

Additional drilling is done to improve delineation ahead of production (Damsleth et al. 1992). Production sampling improves short-term mine planning and leads to a better understanding of the deposit. Uncertainty will resolve itself as production takes place and the mineral deposit is exposed for our greater understanding. The life-of-mine plan is updated on a regular basis (often yearly). A base case long term plan can be established with the current uncertainty and different options explored. The value of future information could be determined by simulating the additional data; this was the idea of the Simulated Learning Model (Cuba et al. 2014). There is flexibility for the plan to adapt to the future, but not change the past.

Flexibility is reduced as mining takes place. There is value in future flexibility (Stirling 2012). A slightly poorer decision, based on currently expected performance, with greater future flexibility may be better than a slightly better decision with less flexibility. The simultaneous optimization over multiple realizations should consider this flexibility.

Optimizing over all realizations simultaneously and considering all realizations through all engineering designs is correct, but difficult for some practitioners to accept (Bratvold et al. 2003; Guyaguler and Horne 2001; Wang et al. 2012). The computational challenges are exaggerated. The computers now are more than 100 times faster than they were about 10 years ago. Also, the ability to use multiple cores and GPUs means that we do not need to compromise much on the complexity of our calculations to consider all realizations all the time. The attraction of a single numerical geological model is undeniable. Most software does not permit easy visualization of multiple realizations. Although the ensemble of realizations should be managed together, the non-uniqueness of multiple realizations is disturbing. The simplest alternative is to use a kriged model for planning and all reporting purposes; the simulated realizations are reserved for uncertainty statements and an understanding of variability.

### **7.6 Alternatives to All Realizations**

Some simple summary models are useful. The probability to meet an economic threshold is useful; high probability is good. The local probability to exceed, say, the global 0.75 quantile is also useful to identify the areas that are surely high: if this probability is high (say over 0.9), then the area is surely high. The local probability to be below, say, the global 0.25 quantile is useful to identify areas that are surely low: if this probability is high (say over 0.9), then the area is surely low. The local variance or the probability to be within 15% of expected are also useful summary measures.

Another approach is to collapse uncertainty into a few summary measures and base planning on them. For example, multiple realizations could be summarized by proportions of ore and waste over multiple realizations within reasonable planning volumes. One could even consider that each block has a proportion of ore and a proportion of waste. The block will be found to be all ore or all waste in the future; the proportions are simply used to collapse uncertainty.

Summarizing multiple realizations is useful. The summaries make use of the multiple realizations. Plans optimized on a summary are never as good as plans optimized over all realizations simultaneously (primarily due to the complexity and non-linearity of most planning operations); however, it may be the only practical approach offered by the available software.

The realizations are equally probable; there is no right one and there is no P50 one and we have no idea if one is closer to the truth than the others. A dangerous practice emerged in the early days of simulation: run the realizations through a quick to calculate transfer function, rank the realizations by the quick-to-calculate response, then consider only selected realizations (say, the P10, P50 and P90) in the "real" more complicated transfer function.

In general, individual realizations should never be singled out for calculations. There is much about a single realization that depends on the random number generator and that is not real. Any one realization could be misleading. There are some specific calculations that could be done with one realization because the variability at specific locations (that we do not trust) averages out over multiple realizations. Blending studies and drilling spacing studies are two examples. It may be enough to run one or a few realizations through a simulation of the homogenization steps to understand the probability of plant upsets and undesirable circumstances. The variability at multiple locations reflects the overall variability and the specific location/time is not critical.

In almost all cases, the simplest and most robust approach is to consider all realizations and take expected values at the end to report a single result.

### **7.7 Concluding Remarks**

Monte Carlo Simulation is a well-established experimental mathematical approach to transfer uncertainty in input geological and engineering variables through to response variables. The primary aim of this chapter was to point out the danger of using one realization instead of an ensemble of realizations. One realization may fall near the middle based on a quick-to-calculate response variable and yet it could be unusually high in some places and low in others. Planning on one realization could be misleading. The nonlinearity and complexity of many real response variables requires the ensemble of realizations to be considered for proper planning and uncertainty assessment. All realizations all the time – anything less will not give correct results.

**Acknowledgements** The author thanks the sponsors of the Centre for Computational Geostatistics (CCG) for supporting this work.

### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **Chapter 8 Binary Coefficients Redux**

**Michael E. Hohn**

**Abstract** Paleoecologists and paleogeographers still make use of binary coefficients in multivariate analysis decades after being introduced to the geosciences. Among the main groups, similarity, matching and association, selecting a particular coefficient remains a confusing and sometimes empirical process. Coefficients within groups tend to correlate highly when applied to datasets. With increasing interest in a probabilistic approach to grouping taxa or faunal lists, the Raup-Crick measure of association is closely related in purpose and empirically to coefficients of association and works well in cluster analysis and ordination. A reasonable strategy is to compare dendrograms and ordinations calculated with several coefficients, care being taken to select coefficients with different performance characteristics. Above all, the practitioner should understand the purpose of each coefficient.

### **8.1 Introduction**

Founding of the International Association for Mathematical Geology resulted in part from the increased use of quantitative methods in the geosciences and simultaneously with developments in computer hardware and availability. This is no less true for paleontology and paleoecology, fields of endeavor characterized by observing, describing, and synthesizing. With the 1960s and 70s came the development of large databases of fossil occurrences from which researchers could formally infer periods of rapid evolution and episodes of major extinction. Patterns of extinction through time could be simulated with random number generators. Paleoecologists studied whether fossil communities persisted through time and the structure of these communities.

This was a period of synthesis. The *Treatise on Invertebrate Paleontology* (Moore et al. <sup>1953</sup>–2015) provided a need for stable taxonomies, a confidence that

M. E. Hohn (✉)

West Virginia Geological and Economic Survey, Morgantown, USA e-mail: hohn@geosrv.wvnet.edu; mehohn@frontier.com

<sup>©</sup> The Author(s) 2018

B. S. Daya Sagar et al. (eds.), *Handbook of Mathematical Geosciences*, https://doi.org/10.1007/978-3-319-78999-6\_8

such a classification could be created, and the motivation for explaining trends in evolution.

Multivariate statistical methods developed in other fields became tools for reducing large datasets to manageable size while providing some degree of objectivity in the analysis. Cluster analysis, multidimensional scaling, factor analysis, and related eigenvector methods became familiar tools to the quantitativelyinclined geologist.

All methods required a measure of similarity, correlation, distance, dissimilarity, or association expressed as a coefficient. Eigenvector-based methods such as factor analysis and principal components analysis (PCA) by definition utilize covariance or correlation among variables (R-mode) and implicitly Euclidean distance in displays of sample coordinates (Q-mode). In contrast, cluster analysis, multidimensional scaling, and principal coordinates analysis (Gower 1971) allow use of a wide range of coefficients, but also require the user to decide which coefficient to use.

Multivariate statistical methods introduced to paleontologists in the 1960s and 70s continue to be used for studying the distribution of fossils in space and through time. With the existence of large databases of fossil occurrences, binary coefficients remain important for comparing collections made over many decades by many individuals.

It still remains to the practitioner to select one coefficient out of the many proposed over the course of more than a century now. Given no clear criterion, some elect to use several coefficients to see whether they affect results.

There is certainly a rich and extensive literature related to the purpose and performance of coefficients, both within the paleoecology literature and in the scientific and engineering literature at large as new applications are found for these coefficients. Surveys of existing measures range in approach, from considerations of the conceptual basis for each, to how well they satisfy the purpose, how they behave relative to each other, to how well they behave relative to a goal set by the author, whether or not they achieve a clear criterion such as satisfying metric properties (Gower and Legendre 1986), which seem to give similar results with each other or are correlated; and above all, whether coefficients include mutual absences. That last criterion might appear to be a small detail compared to the other comparisons but it introduces a fundamental question about the role that chance plays in the distribution of fossils in the collection under study.

This chapter reviews the criteria and arguments used in the past four decades in comparing binary coefficients. In this chapter, I will first group coefficients into three families based on shared formulations and behavior; discuss how such factors as abundance of taxa or poor sampling can affect coefficients; consider metric properties of coefficients; look at probability-based coefficients; apply several coefficients to paleoecological data; and sum up where we are today compared to four decades ago.

I will introduce coefficients as I go along, using what has become standard notation for binary coefficients. Assume we have sampled taxa from *N* locations. Then for a given pair of taxa:

### *a =* number of co-occurrences

*b =* number of locations where taxon 1 occurs and taxon 2 does not *c =* number of locations where taxon 2 occurs and taxon 1 does not *d =* number of locations where neither taxon is observed *N* = *a*+ *b*+ *c*+*d*.

### **8.2 Empirical Comparisons and a Taxonomy**

As the use of cluster analysis has expanded beyond the biological and geological sciences, papers have appeared in the literature that try to get a handle on the multitude of coefficients by comparing the way they behave relative to each other or to a criterion based on an application. Although outside the field of paleoecology, these publications often cast a wide net in gathering coefficients and present surveys purely empirical in nature. In the general area of pattern recognition, Choi et al. (2009) compute correlation coefficients among 76 binary coefficients for several types of random or structured datasets, observing that pairwise correlations between coefficients can be very high, depending in part on the pattern and number of presences.

In a companion paper, Choi et al. (2010) created random binary datasets, computed values for each coefficient, averaged the trials to create a dendrogram of the 76 coefficients. They identify eleven clusters, some with only a single coefficient, several with two to six members, and two large clusters with over twenty members. The second largest includes such frequently-used coefficients as the Jaccard, Otsuka, Dice, and the Bray and Curtis, where:

$$\begin{aligned} \text{C\_{Jaccard} = a/(a+b+c)}\\ \text{C\_{\text{Otsuka}} = a/\sqrt{[(a+b)(a+c)]}}\\ \text{C\_{\text{Dice}} = 2a/(2a+b+c)}\\ \text{C\_{\text{Bray}\&\text{Curstis}} = (b+c)/(2a+b+c)} \end{aligned}$$

That these coefficients are correlated highly in an absolute sense should come as no surprise given the algebraic relationships between several. For example:

$$C\_{\text{Dice}} = 1 - C\_{\text{Bray and Curtis}}$$

converting a dissimilarity coefficient (Bray and Curtis) into a coefficient expressing similarity. The difference between the Dice and Jaccard coefficients is in weighting the mutual occurrences. Remember that many coefficients were defined as measures of similarity, dissimilarity, or association rather than as input to clustering and ordination routines. Their creators had specific reasons for selecting and weighting the terms—*a*, *<sup>b</sup>*, *<sup>c</sup>*, or *<sup>d</sup>*—in the context of a study and according to some research goal. In many cases they might have been fully aware that their coefficient was similar to one in the literature, but their coefficient measured what they wanted to measure.

The largest group of coefficients includes a subset with among others the Simple Matching (called the Sokol and Michener in their paper), Rogers and Tanimoto, and Hamann coefficients, where:

$$\begin{aligned} C\_{\text{SimpleMatching}} &= (a+d)/(a+b+c+d) \\ C\_{\text{Rogers and Tanimoto}} &= (a+d)/[a+2(b+c)+d] \\ C\_{\text{Hannann}} &= [(a+d)-(b+c)]/(a+b+c+d) \end{aligned}$$

Notice that these coefficient include the term *d* for mutual absence. The Rogers and Tanimoto coefficient is the same as the Simple Matching but for increased weighting for mismatches in the denominator and the Hamann can be expressed in terms of the Simple Matching by substituting *N* − (*a* + *d*) for (*b* + *c*).

A third, small group includes three similar coefficients, two derived from the familiar χ<sup>2</sup> statistic, including the Phi coefficient:

$$C\_{\rm Pli} = (ad - bc) / \sqrt{[(a+b)(a+c)(b+d)(c+d)]} = \sqrt{\left(\chi^2/N\right)}$$

These coefficients express correlation; in fact *C*Phi is the correlation coefficient for binary data and can be calculated in the same way as a correlation coefficient for non-binary data.

Related to these coefficients is a large cluster characterized by a numerator containing the term (*ad* – *bc*) or ad or (*<sup>a</sup>* <sup>+</sup> *<sup>d</sup>*). Examples are the Yule's Q (or simply Yule), Ochiai 2, and Gower:

$$\begin{aligned} C\_{\text{Yule}} &= (ad - bc) / (ad + bc) \\ C\_{\text{Ochiai2}} &= ad / \sqrt{[(a + b)(a + c)(b + d)(c + d)]} \\ C\_{\text{Gover}} &= (a + d) / \sqrt{[(a + b)(a + c)(b + d)(c + d)]} \end{aligned}$$

Similar to the matching coefficients, these and the Phi express agreement between two entities based on mutual presence and absence, but adjusted for relative abundance of the entities, analogous to the centering and scaling in calculating the correlation coefficient and Phi.

These four groups account for most of the binary coefficients one is likely to encounter in the geosciences, including ones discussed below. If we lump the last two clusters, a simple taxonomy of coefficients has as groups:

1. **Similarity coefficients**, computed by the number of mutual occurrences, scaled by the total number of features occurring in one or the other entities. In paleoecology, entities can be taxa and features can be locations. Some coefficients can express similarity by calculating *b* + *c* rather than *a*, but the coefficient can be converted to similarity by subtracting from 1.


This taxonomy agrees with that in Hohn (1976) except I am using a more rigorous definition of a distance coefficient by not including the City Block metric in that group. However, √(*b* + *c*) is a distance.

Even when two coefficients are not mathematically equivalent, they can be related monotonically (Gower and Legendre 1986) and give virtually the same results when used in cluster analysis or nonmetric multidimensional scaling. In lieu of selecting a single best coefficient, many researchers perform multiple cluster analyses or ordinations to observe whether results change with choice of coefficient. In such an exercise, one wants to make sure to select coefficients with different properties or behaviors.

### **8.3 Effects of Rare and Endemic Taxa**

In an empirical study of eight similarity coefficients, Jackson et al. (1989) used a dataset comprised of 25 species of fish observed in 52 lakes in south-central Ontario, Canada. One feature that distinguishes this dataset is that species range from very common to rare, from as many as 47 lakes to as few as 2. The eight coefficients are the Jaccard, Dice, Simple Matching, Rogers and Tanimoto, Otsuka ("Ochiai" in their paper), Phi, Yule, and the Russell and Rao:

$$C\_{\text{Russell and Rao}} = a/(a+b+c+d)^2$$

Unsurprisingly, the Jaccard and Dice gave nearly identical results in a cluster analysis. The same held for the Simple Matching and Rogers and Tanimoto coefficients. Results for the Otsuka were close to the Jaccard and Dice. The dendrogram for Russell and Rao coefficient shows almost no clusters although the general ordering of the species was very similar to the Jaccard, Dice, Simple Matching, Rogers and Tanimoto, and Otsuka.

They also performed principal coordinates analysis for each of the eight coefficients. They observed that the order of species on the first axis correlated highly with the number of lakes in which each occurred for all but the Otsuka, Phi, and Yule coefficients. Some of the correlations are very high, over 0.99 for the Simple Matching and Rogers and Tanimoto. In other words, the first axis corresponded to the frequency of each species, a general "size" factor in their words. Species abundance correlated poorly with the two major principal coordinates axes for the two coefficients of association, the Phi and Yule. The Otsuka showed some effect of species frequency. Nonmetric Multidimensional Scaling gave similar results. The order of species in dendrograms from cluster analysis also showed this frequency effect for the similarity coefficients; not so for the two coefficients of association.

They concluded that similarity coefficients—what they term co-occurrence coefficients—are heavily influenced by frequency, whereas the implicit centering that takes place in calculating the Phi and Yule mitigate this effect. They also conclude that the Otsuka formulation does a centering that partially eliminates the frequency effect.

### **8.4 Adjusting for Poor Sampling**

In the context of Q-mode analysis—that is, the comparison of samples rather than the R-mode comparison of taxa—Alroy (2015a, b) looks at the effect of uneven sampling and consequent uneven sample size on four binary coefficients: the Forbes, a modified Forbes coefficient, Simpson's coefficient, and the Dice, where the Forbes coefficient is:

$$C\_{\text{Forbes}} = a \, N/[(a+b)(a+c)]$$

and the Simpson:

$$C\_{\text{Simpson}} = a / [\min(a+b), (a+c)]$$

Alroy modifies the Forbes coefficient in two ways. First, he argues against including mutual absences and therefore substitutes *n* for *N* where *n* = *a* + *b* + *c*. Secondly, he adds constants to correct for an upward bias in the coefficient:

$$C\_{\text{ForbesMod}} = a(n + \sqrt{n})/[(a+b)(a+c) + a\sqrt{n+1/2}bc]^{\frac{1}{2}}$$

Although there is no theoretical basis for these constants, the resulting coefficient does accomplish what he sets out to do. In several analyses of real and simulated datasets, he shows that both versions of the Forbes coefficient and the Simpson far outperform the Dice. This is consistent with results obtained by Jackson et al. (1989) in which coefficients such as the Dice are influenced very much by species frequency in R-mode analysis.

Alroy clearly favors the modified Forbes over Simpson's coefficient. However results for both in cluster analysis and principal components analysis are very similar and would probably lead to the same conclusions based on the relative positions of samples on dendrograms and principal coordinates axes. This is no surprise given that the Simpson was formulated to account for uneven sample size.

Although Alroy dismisses probabilistic coefficients and coefficients of association in part for including mutual absences, it would be interesting to compare them with the two Forbes and the Simpson coefficients with his datasets.

These papers address the problem of working with datasets of mixed, perhaps unknown sampling regimen. The difference between otherwise identical faunal lists might be the time or skill in observation. This is perhaps less of an issue when a dataset comes from a single sampling campaign, but in these days of large databases compiled from many studies this is a problem to be taken seriously. Alroy'<sup>s</sup> results argue for careful selection of a coefficient and suggest that analysis with multiple coefficients might be beneficial if sampling issues are suspected.

Alroy points out that the Forbes coefficient has fallen out of use over time. However, since the publication of his papers, Halliday et al. (2017) used his modified form of the Forbes coefficient in cluster analysis of Late Cretaceous vertebrates across India. Although the papers by Alroy and by Halliday et al. describe ordination and cluster analysis of localities, the same problem of uneven sampling exists in analysis of taxa and their arguments and findings should have application in R-mode analysis as well.

### **8.5 Metric? Euclidean?**

Some attention has been paid in the past with the question whether a dissimilarity coefficient is metric, Euclidean, or neither. A coefficient is metric if for every triplet (*i*, *j*, *k*) the following inequality holds:

$$D\_{\vec{y}} + D\_{\vec{u}} \ge D\_{\vec{\mathcal{H}}}$$

On the face of it, methods such as principal coordinates analysis require a dissimilarity that is Euclidean. In actuality, Gower and Legendre (1986) and others have observed that departures from strict Euclidean geometry for many coefficients are generally small. Adding a constant to a distance can sometimes take care of this problem. It sometimes works to use the square root of the distance. They include a table showing that many familiar similarity coefficients, *C*, are metric but not Euclidean if converted to a dissimilarity coefficient 1 – *<sup>C</sup>* and even more are metric and many Euclidean if <sup>√</sup>(1 – *<sup>C</sup>*) is calculated. They consider most of the binary coefficients listed above with the notable exception of Yule's coefficient.

Zhang and Srihari (2003) discuss the properties and behavior of similarity, matching, and coefficients of association, including metric properties, equivalent measures of similarity and dissimilarity, discriminatory capability of the coefficients, and the effect of weighting mutual absences. Like many authors they prefer metric coefficients. A large proportion of papers in the geosciences utilize nonmetric multidimensional scaling or cluster analysis with no requirement for the coefficient to be Euclidean or even metric. Reasons for selecting a method for multivariate analysis no doubt vary among authors, ranging from convenience or familiarity, available methods in a statistical package, to wanting to avoid the stronger requirements of eigenvector-based methods. However as Gower and Legendre (1986) point out, proportionally small deviations from geometric assumptions of an eigenvector method affects the results very little.

### **8.6 From Expected Values to Null Association**

We can look at the diversity of coefficients along a spectrum from similarity coefficients at one end to coefficients of association at the other. In comparing faunal lists, for instance, similarity coefficients count the number of species in common between two locations normalized by the number of species found in one or the other. In other words, they can be said to measure overlap in faunal lists in a Q-mode analysis or geographic overlap of two taxa in an R-mode analysis.

Midway along the spectrum are coefficients that compare an observed value with the expected value. As described by Alroy (2015a), the chance of a species appearing in the faunal list at one site is (a + b)/N, the chance at a second site (*a* + *c*)/*N*, and the chance of being found in both is [(*a* + *b*) (*a* + *c*)]/*N*<sup>2</sup> . Therefore, the number of species expected to be found in both is [(*a* + *b*) (*a* + *c*)]/*N* and the ratio of the observed number *a* to the expected number is *aN*/[(*a* + *b*)(*a* + *c*)].

Hohn (1976), Raup and Crick (1979) and others have argued that cluster analysis or ordination should consider whether observed overlaps in faunal lists in paleogeographic studies or occurrence of taxa in paleoecological studies represent anything more than a random distribution of taxa through space. Of course there is no denying that species respond to environmental and geographic variables, but the question is how to separate similar distributions that arose by chance from those that represent nonrandom processes.

Within a biological context, Hubálek (1982) surveyed forty-three coefficients, eliminated about half based on algebraic equivalence, mere difference in scale, or failure to meet several criteria, and compared the rest through product-moment correlation and cluster analyses. Although one of these criteria is monotonicity with √(χ<sup>2</sup> ), Hubálek stops short of recommending a coefficient such as Phi that is related directly to a test of significance in association.

In contrast, I proposed (Hohn 1976) that we should pay more attention to the Phi coefficient. Raup and Crick (1979) derive the formula for exact probabilities equal to Fisher's Exact Test for independence in 2 by 2 tables, an alternative to the usual χ<sup>2</sup> test. They modified what is essentially a Phi coefficient in comparing faunal lists by using a Monte Carlo method to weight taxa according by abundance. The result is a coefficient that like Phi and similar coefficients includes mutual absences, but represents a further refinement by taking relative abundance of taxa into account.

Winrow and Sutton (2014) calculated five coefficients—Raup-Crick, Simpson, Jaccard, Dice, and Otsuka (Ochiai)—in a paleogeographic study of lingulate brachiopods during the Early Paleozoic. Unable to determine a single best coefficient, they opted to calculate and compare several. Unsurprisingly, the Jaccard, Dice, and Otsuka gave very similar results. Raup-Crick and Simpson coefficients showed different patterns among pairs of faunal lists representing different paleocontinents. They do not explain why coefficients would give different results other than attributing several anomalously-high values of the Simpson to small sample sizes.

Zhang and Srihari (2003) survey binary dissimilarity coefficients in the context of character recognition; some of their results are instructive. In their look at nine familiar coefficients they define relative discriminatory power in terms of entropy, itself proportional to the variance of dissimilarities in multivariate space. They consider coefficients with a wide range of values to have potentially greater discriminatory power, finding that the Russell and Rao coefficient had the poorest discriminatory power and the Jaccard and related coefficients moderate power. Highest discriminatory power was shared by the correlation coefficient, Yule and Rogers and Tanimoto. In the study by Winrow and Sutton, the similarity coefficients had a narrow range of values compared with the Raup-Crick and Simpson.

### **8.7 Illustrative Example**

Both R-mode and Q-mode analysis were performed on presence-absence data collected from five outcrops of the Middle Devonian Hamilton Group in New York State, although only ordinations of taxa will be shown here for reasons of space. Lithology of the interval sampled included thin limestones, mudstones, silty mudstones, and calcareous siltstones. The data matrix comprises 43 samples and 32 taxa identified to species when possible (Hohn 1975).

Cluster analysis, principal components analysis, and principal coordinates analysis were carried out; results of principal coordinates analysis best illustrate similarities and differences among the coefficients used. The statistical package PAST (Hammer et al. 2001) offers a wide range of multivariate methods and coefficients including similarity, matching, and association. I looked at results for the Phi (Correlation Coefficient in PAST) and Raup-Crick coefficients to observe their near-equivalence; the Jaccard as representative of similarity coefficients; Simpson's coefficient as an unusual asymmetric coefficient used with some frequency; and to represent matching coefficients, the Hamming normalized to lie between 0 and 1:

$$\mathbf{C\_{Hamming}} = (b+c)/N$$

In signal processing and information theory, Richard Hamming is known for the Hamming distance and Hamming window in addition to other contributions. Note the simple relationship between the normalized Hamming and Simple Matching coefficients:

*C*SimpleMatching = 1 − *C*Hamming

Looking at plots of the first two principal coordinate axes (Figs. 8.1, 8.2, 8.3, 8.4 and 8.5), one might be struck by the how similar they appear. However, most of us would probably consider the results from the Hamming (Simple Matching) coefficient in Fig. 8.1 difficult to interpret. The Jaccard is a great improvement (Fig. 8.2) as indeed is Simpson's coefficient (Fig. 8.3). The Phi coefficients of association and Raup-Crick probabilistic measure give almost identical results with each other (Figs. 8.4 and 8.5).

The biggest differences among the five plots are positions of the most abundant taxa such as the brachiopod *Tropidoleptus* and bivalve *Paleoneilo*. They occur in a large proportion of samples (Table 8.1) and provide little discriminatory power among assemblages. Relatively abundant taxa score highly in an absolute sense on the second principal coordinate axis (vertical axis) for the Hamming and Jaccard coefficients, less so for the Raup-Crick and Phi. There is a clear correlation between principal coordinate scores on this axis with taxon count for the Hamming and Jaccard coefficients (Fig. 8.6). This observation agrees with the findings of Jackson et al. (1989).

**Fig. 8.1** Principal coordinates analysis with Hamming coefficient of dissimilarity

**Fig. 8.2** Principal coordinates analysis with Jaccard coefficient

Based on percent of variance explained by the first three principal coordinate axes (Table 8.2), the Hamming coefficient would appear to perform best. Similar results were obtained from nonmetric multidimensional scaling of each coefficient matrix (Table 8.3). But we already know that a portion of the variance correlates with taxon abundance. This observation suggests that selecting a coefficient based by variance explained has limited value if the coefficient measures the wrong thing.

Q-mode analyses showed similar correlation of abundance with principal coordinate scores calculated from Hemming and Jaccard coefficients. The relationship is not as strong because no sample contained more than 26% of the taxa, whereas *Tropidoleptus* in the R-mode analysis occurred in 84% of samples.

Note that the Raup-Crick procedure does not yield a binary coefficient in the sense of all the others, but rather accomplishes through Monte Carlo sampling, a similar measure as the correlation coefficient. Practitioners use the Raup-Crick measure in the same way as any of the other binary coefficients for cluster analysis and ordination. However there is no guarantee that it has strictly metric properties, and indeed, principal coordinates analysis with the Raup-Crick statistic yielded a large proportion of negative eigenvalues.

**Fig. 8.3** Principal coordinates analysis with Simpson's coefficient

### **8.8 Discussion and Conclusions**

Studies published over the past decade give a taste of the application of binary coefficients of all types.

Brayard et al. (2007) used distances 1 – *<sup>S</sup>*Dice in Q-mode cluster analysis and ordination of Early Triassic ammonoid faunas, citing the double weight given to mutual presences, thus downweighting the influence of unique species occurrences and not giving any weight to mutual absences. They used the square root of the dissimilarity matrix so that the resulting distances would be metric and Euclidean (Gower and Legendre 1986).

In studies of faunal lists of bivalves from around the globe, Schmachtenberg (2008) compared four coefficients: Jaccard, Simpson, Raup-Crick, and a measure of endemism. He did not do any cluster analyses or ordinations, but rather regressed value of each coefficient on geographic distance. The Simpson, Raup-Crick, and natural log of the Jaccard coefficient performed almost equally well.

Huang et al. (2012) considered the performance of five coefficients—Jaccard, Dice, Cosine, Yule's Y, and Raup-Crick—in cluster analysis and nonmetric multidimensional scaling of Silurian brachiopod assemblages representing time after the Late Ordovician extinction events. They preferred the Raup-Crick coefficient for

**Fig. 8.4** Principal coordinates analysis with Raup-Crick Coefficient

ordination because it yielded the lowest stress value. On the other hand, they primarily used Yule's Y in their cluster analyses, where:

$$C\_{\text{YuleY}} = (\sqrt{ad} - \sqrt{bc})/(\sqrt{ad} + \sqrt{bc})$$

In a paleoecological and paleogeographical analysis of Late Ordovician cephalopods, Kröger and Ebbestad (2013) used the Raup-Crick and Bray and Curtis coefficients in cluster analysis of assemblages and concluded that the Raup-Crick dissimilarity index gave better-resolved groups.

Balseiro (2016) studied changes in composition and diversity of brachiopods and bivalves in western Argentina during the main Carboniferous glacial event. The author observed few differences among results from several types of ordination and choice of coefficients, including the modified Forbes coefficient of Alroy (2015b) and Bray and Curtis dissimilarity.

Many reviewers of binary coefficients note the controversy that surrounds the question whether mutual absences should be included in a coefficient. Some authors categorically reject coefficients that include *d* (e.g. Shi 1993). Reasons cited include: mutual absences do not contain information; we can never know the total number of taxa *N* in a paleogeographic study; we can inflate differences through

**Fig. 8.5** Principal coordinates analysis with correlation (Phi) coefficient

inappropriate inclusion of taxa or samples; or sampling effort or success is uneven and therefore the appropriate *N* is unknown. There are counterarguments for each one of these objections and the user is left to decide for his or herself. For instance, knowledge of mutual absences is necessary to evaluate the probability of an observed pattern of occurrences, and therefore it conveys information. While we cannot know *N* exactly, we have ways to access completeness of sampling, and after all, any statistic is based on samples and *N* is no exception.

In contrast to the other objections to the use of mutual absences, uneven sampling among locations appears to be a real problem and the effect on even probabilistic measures of association is not well understood. Simpson's coefficient and modified Forbes coefficient of Alroy (2015a, b) attempt to correct for this problem. Neither coefficient conveys any probabilistic information. This is the price one pays when sampling is less than optimal. To draw strong conclusions sampling methods are all-important.


### **8.9 Summary**


**Fig. 8.6** Bivariate plots of number of samples in which each of the 32 taxa occurred (horizontal axes) and scores on principal coordinate axes (vertical axes). Only scores on the second axis are shown for Hamming, Jaccard, Correlation coefficient (Phi), and Raup-Crick coefficients


**Table 8.3** Variance along each axis and stress for nonmetric multidimensional scaling

**Table 8.2** Percent of total variance explained by each axis in principal coordinates analyses by coefficient


### 8 Binary Coefficients Redux 159

obtained using more than one coefficient could help the practitioner partition out the least informative occurrences.


In conclusion, it remains a reasonable strategy to compare dendrograms and ordinations calculated with several coefficients. Care should be taken to select coefficients with different performance characteristics. Finally, the practitioner should understand the purpose of each coefficient.

**Acknowledgements** Gordon Baird helped extensively in the early stages of this project by sharing his knowledge of outcrop locations and updated correlations of the Hamilton Group. Thomas Kammer made valuable suggestions to an early draft of this paper.

### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **Chapter 9 Tracking Plurigaussian Simulations**

**M. Armstrong, A. Mondaini and S. Camargo**

**Abstract** The mathematical method called Plurigaussian Simulations was invented in France in the 1990s for simulating the internal architecture of oil reservoirs. It rapidly proved useful in other domains in the earth sciences: mining, hydrology and history matching. In this chapter we use complex dynamic networks first developed in statistical mechanics to track the diffusion of the method within academia, using citation data from Google Scholar. Since governments and funding agencies want to know whether ideas developed in research projects have a positive effect on the economy, we also studied how plurigaussian simulations diffused from academia to industry. The literature on innovation usually focusses on patents but as there were few on plurigaussian simulations, we needed criteria for deciding whether an innovation had been adopted by industry. Three criteria were identified:


M. Armstrong (✉) <sup>⋅</sup> S. Camargo

Escola de Matemática Aplicada, Fundação Getulio Vargas, Rio de Janeiro, Brazil

e-mail: margaret.armstrong@mines-paristech.fr; margaret.armstrong@fgv.br

M. Armstrong

A. Mondaini Department of Physics, UERJ, Rio de Janeiro, Brazil

© The Author(s) 2018 B. S. Daya Sagar et al. (eds.), *Handbook of Mathematical Geosciences*, https://doi.org/10.1007/978-3-319-78999-6\_9

MINES Paristech, PSL Research University, CERNA – Centre for Industrial Economy, i3, CNRS UMR 9217, Paris, France

The second criterion revealed how important master's level courses are in training geoscientists in the latest techniques. Their role in transferring knowledge to industry is undervalued in current procedures for evaluating university departments.

**Keywords** Complex dynamic networks ⋅ University-industry interaction Technology diffusion ⋅ Google Scholar citations

### **9.1 Introduction**

In August 2015 Fundacao Getulio Vargas (FGV)<sup>1</sup> organized a 3 day seminar on applied research and invited Jane Tinkler from the London School of Economics to give a plenary lecture on how to assess the impact of research in the social sciences on policy decisions. She stressed the fact that it often takes 15–20 years to see the effects of academic research in the real world. Her talk inspired the lead author to ask how research in the geosciences diffuses within academia and from there, into industry. Why is it important to understand how ideas are adopted by industry? Because in the future, in addition to publishing in top journals, academics will probably need to demonstrate that their research is generating innovations to fuel national economies. For example, the Australian government has been funding a national survey since 2001 to collect data on the commercialization of the results of publicly funded research, especially their impact on intellectual property.

Since the pioneering work of Schumpeter in the 1940s, economists have agreed that a large component of modern economic growth has been driven by "innovation", that is, the arrival of new ideas. Nowadays, most papers on the relationship between scientific research and innovation use citation data to measure the production of new ideas in science and patent data to measure the creation of new potentially successful commercial ideas. Patents have become particularly important in this context for three reasons (Agrawal and Henderson 2002):


This approach has proved very fruitful in fields where the technology is evolving rapidly and where patents protect their inventors, for example, pharmacy and biotechnology, nanotechnologies, and wind and solar power generation. But it is

<sup>1</sup> FGV is a private university and think tank located in Rio de Janeiro, that has internationally recognized research groups in economics, law, public administration, and management, and more recently an energy group and an applied maths department.

not pertinent in sectors where patents are less common and where the transfer of new ideas from academia to industry follows different channels (Zellner 2003; Martin and Tang 2007; Moser 2012; Maietta 2015). Geosciences is one such domain.

In order to discover how ideas diffuse within academia and from there into industry, we chose to focus on a specific new method (plurigaussian simulations) which was invented in France in the 1990s for simulating the internal architecture of oil reservoirs (Galli et al. 1994; Armstrong et al. 2011). It rapidly proved useful in other domains in the earth sciences: mining, hydrology and history matching. In the first part of this chapter, after collecting citation data from Google Scholar, we use complex dynamic networks first developed in statistical mechanics to track the diffusion of the method in the academic world. In the second half of the chapter we study how this method moved into industry.

The chapter is divided into five sections. The next one (Sect. 9.2) is a literature review on complex dynamic networks, especially citation networks. In Sect. 9.3 this technique is applied to our citation network for plurigaussian simulations. As only 9 out of the 550 citations were patents, these were not the vector in transferring the method into industry. In Sect. 9.4 we identify three key indicators showing how this innovation was incorporated in industry. Our conclusions follow in Sect. 9.5.

### **9.2 Review of Complex Networks**

Over the past 30 years the methods developed by physicists for studying networks in statistical mechanics have been adapted to analyzing other types of networks including the world-wide web (Broder et al. 2000; Albert 1999, 2000), power grids (Watts and Strogatz 1998), telephone call grids (Abello et al. 1998) and airline timetables (Amaral et al. 2000). Newman (2001) and Barabasi et al. (2002) both studied citation networks in which the authors were the nodes in the network and a link was formed between two authors when they co-authored a paper. Newman (2001) studied four such collaboration networks:


Although the databases went back earlier Newman limited his study to the window from 1995 to 1999 in order to obtain a good static photo of the conditions at that time. In contrast Barabasi et al. (2002) studied the evolution over time of patterns of collaboration in two specific fields: mathematics and neuro-science, over the period from 1991 to 1998, using databases consisting of 70,975 different authors and 70,901 papers for mathematics and 209,293 authors 210,750 papers for neuroscience.

By 2000, theoretical and empirical studies had uncovered three important results: firstly, most networks have the so-called small-world property which means that the average separation between nodes is rather small; secondly, real networks display a higher degree of clustering than expected for purely random networks and finally, the degree distribution follows a scale-free power-law form (Barabasi et al. 2002). Initially it had been expected that the Web would be a random network like those characterized by Erdos and Renyi (1959). In that case the probability of any two nodes being connected is constant, and most nodes have a degree (number of connections) that is close to the average and the degree distribution is exponential. Albert et al. (1999) showed that the distribution for the Web is a power-law, which means that a few nodes are highly connected while the vast majority have a smaller degree than average.

By computing the statistics of the number of authors per paper, the number of papers per author and the number of collaborators per author in various fields, Newman (2001) confirmed that their distributions follow a power-law form. All the networks contain a giant component of scientists, any two of whom can be connected by a shortest path of intermediate collaborators.

### **9.3 Network Analysis of Google Citations of Plurigaussian Simulations**

The first step in our study consisted of collecting all the publications up to December 2015, found by Google Scholar for the term "Plurigaussian simulations". A total of 555 references were obtained. Google Scholar had ordered them from the most relevant to the least (as determined by its algorithm). They include journal articles, working papers, doctoral and master's theses, final year projects, patents and the two books on Plurigaussian Simulations together with chapters from the books which are sold separately by the publishers. These citations can be split into four groups:



*be to constrain geostatistical simulations by the model results, e.g., training maps for multipoint or plurigaussian methods*".

(4) Papers which do not mention plurigaussian simulations at all.

Of the original 555 references, 307 fell into the first category, 166 into either the second or third while 82 fell into the fourth category. The last group were eliminated from further study. For the 473 references in the first 3 categories, we noted the information listed in Table 9.1. Table 9.2 summarises the statistics of applications in the four main domains.



**Table 9.2** Results for the four main applied fields


### *9.3.1 Building a Citation Network*

In contrast to Newman (2001) and Barabasi (2002) who built their citation network by considering authors as nodes and linking those who had joint papers, we constructed the plurigaussian network by considering each publication as a node with an edge between two of them when one publication cites the other one, producing a directed network. Our network (Fig. 9.1) is displayed with different colours for the different fields of application: black for oil, mauve for mining, blue for water, red for history matching, green for agriculture, mustard for soil science and white for others. As expected, publications in the same field tend to be clustered together in the network.

**Fig. 9.1** The citation network for plurigaussian simulations, with different colours indicating the different fields of application: black for oil, mauve for mining, blue for water, red for history matching, green for agriculture, mustard for soil science and white for others. The size of the nodes are proportional to their rank according to PageRank and Betweenness centrality


**Table 9.3** Rank of publications according centrality measures, namely Pagerank and Betweenness

As the network is composed of about 500 publications, it is interesting to know which nodes are the most important, and centrality measures are a good way to provide such answers based on the topology of the network. Here we used two measures: PageRank and Betweenness centrality. PageRank (Page et al. 1999) evaluates the importance of a node based on how many edges point to it, Betweenness centrality (Freeman 1977) estimates whether a node is likely to be placed between other pairs of vertices. Figure 9.1 shows the network of plurigaussian simulations when the node size is proportional to PageRank centrality (left panel) and Betweenness centrality (right panel). At first glance the figures look very similar but there are differences in the importance of some of the nodes as can be seen in Table 9.3 which lists the ten most important publications according to these two centrality measures.

### **9.4 Diffusion of the New Method into Industry**

In our analysis of the citation network we had been surprised to find so few patents (only 9 out of 550). Moreover these only started in 2006 (i.e. 10 years after the invention of the method). This was because software could not be patented software before then (See Appendix 9.1 for more detail on this). As patent data could not be used to determine when the method actually reached industry, we need some other criteria. Based on Tijssen et al. (2009), we used the following:


It is important to distinguish between the two. Resource companies like Shell or Chevron, or Rio Tinto or Anglo-American are "end-users" whereas consultants and software vendors transfer the idea to end-users, so their business plans are quite different.

The citations came from four main applied fields<sup>2</sup> (oil, mining, water resources and history matching). Looking back at Table 9.2, very few papers in water resources had an author from a company or a consulting firm (only 9.2%) compared to 57.8% for oil, 35.2% for mining and 23.8% for history matching. This is probably because water is a public good that generates relatively small profits compared to the oil industry or mining.

### *9.4.1 Co-authors and Repeat Co-authors from Industry*

Although having a co-author from a company or a consulting group shows that the company is interested in the new technique, it does not tell us whether they have effectively adopted it. In some cases, co-authoring a paper with an academic is rather like "window-shopping". It allows the company to test a new method on a case-study but adopting it as a standard procedure requires more time and effort (Martin and Tang 2007). Table 9.4 lists the companies and consultants which had co-authored more than 1 paper together with the number of papers, for each type of application. In applications to oil, seven companies and consulting groups had co-authored two or more papers, compared to 11 which had contributed to only 1; similarly five mining companies had co-authored two or more papers, compared to 8 which contributed to only 1 paper. It would be interesting to know what happened to the 11 oil companies that only participated in 1 paper, and likewise for the 8 mining companies. Did they lose interest in the method after an initial test study? Or did they decide to train their personnel or to outsource studies to consultants?

<sup>2</sup> Among the other papers, some were theoretical; a few were applications to precision agriculture or soil science. Plurigaussian simulations were even used to map the soil layers in archeological sites in ancient Rome (Folle 2009; Raspa 2000).


### *9.4.2 Surveys of Academics and Consultants*

The last part of the study consisted of a survey to find out (a) which companies had started training their personnel by sending them to short courses or to postgraduate and masters courses, and (b) which were outsourcing studies. While there are clear limitations to what can be obtained from voluntary declarations because people tend to bias their answers and while our survey was far from exhaustive, the results give us some ideas about what has happened.

Three groups (the IFP at Rueil-Malmaison, the CG at Fontainebleau and Jeffrey Yarus and Rich Chambers, in the USA) ran extensive programs of short courses. Table 9.5 lists the short courses on truncated gaussian and plurigaussian simulations given by Christian Ravenne<sup>3</sup> and Brigitte Doligez, both of the IFP. The Centre de Géostatistique was also active in giving short courses, often as pre-conference courses or in-house for oil companies, and the consulting and software company, Geovariances, regularly gives a 5 day course on conditional simulations applied to mining and has a 3 day course on advanced geostatistics for reservoir characterization. Both have modules on plurigaussian simulations. From 2000 to 2006, Jeffrey Yarus and Rich Chambers gave 4–5 courses per year through the Nautilus Training Organization and two more per year in Abu Dhabi for Schlumberger. After joining Landmark, they continued giving courses in Houston and London each year.

Most postgraduate geostatistics courses have modules on simulation. Some students choose this topic for their project/thesis. The Ecole des Mines de Paris has

<sup>3</sup> The list is available in his HDR thesis (Ravenne 2001). At the time he was Directeur Associé de Recherche at the IFP. He subsequently retired in 2008.


**Table 9.5** Short Courses on the truncated gaussian method and on plurigaussian simulations by Christian Ravenne who was a geologist at the IFPEN before his retirement, and more recently by Brigitte Doligez, who is also a geologist at the IFPEN

been running a 9 month postgraduate geostatistics course called the CFSG<sup>4</sup> since 1980. The last 3 months are devoted to a personal project on a real case-study, usually provided by the company sponsoring the student. Similarly, final year undergraduates and masters students have carried out studies on plurigaussian simulations at the University of Chile, at Edith Cowan University (Western Australia), at the University of Adelaide (South Australia), at the federal university

<sup>4</sup> CFSG = Cycle de Formation Spécialisée en Géostatistique.


**Table 9.6** List of the titles of confidential reports on plurigaussian simulations by students at various universities


UFRGS (Rio Grande do Sul, Brazil), to mention just a few. As most of these are confidential, Google Scholar cannot find these. Table 9.6 lists the titles of projects that involved plurigaussian simulations and were carried out at various universities. One interesting feature is the number of studies that used data from the South American mining companies, Codelco and Vale, which were absent from the list of "repeat co-authors".

Lastly, the consulting arm of the IFPEN, Beicip-Franlab, kindly provided us with a list of the consulting projects involving plurigaussian simulations that they have carried out for clients (Table 9.7). The range of companies involved is striking. Almost all of them are national oil companies, many located in the Middle East.

Looking through these three tables, it is clear that the publications found by Google Scholar are really only the tip of the iceberg. Underneath, there are many unpublished dissertations and project reports carried out by final year and masters level students which remain confidential—in contrast to PhD theses which are usually available on the internet. Most of these final year and masters dissertations were carried out on company data by a student who had been given time off work to study. We believe that these studies are a key step in getting new methods into to regular use in industry. This suggests that university assessments should take account of final year projects and master's level dissertations, which is not the case at present in most countries, because this is one of the key channels for transferring new innovations into industry—at least as far as the earth sciences are concerned.

### **9.5 Conclusions and Perspectives for Future Work**

Plurigaussian simulations were developed in France in the mid-1990s for simulating the internal architecture of oil reservoir in order to better predict oil and gas production. Although they were originally designed for the petroleum industry, they rapidly found applications in mining and hydrology and then for history matching in the oil industry. From France the technique diffused to other European countries, then to countries like the USA, Brazil and Chile.

This chapter uses complex dynamic networks to describe how the method diffused within the academic community. Citations found using Google scholar corresponding to the term "plurigaussian simulations" were used to track its diffusion within academia. In contrast to most citation networks where the nodes are the authors of papers and the link corresponds to co-authoring, in our network the papers themselves are the nodes which are linked when one paper cites another.

Papers were split according to the domain of the application: oil, mining, water or history matching. As expected, we found that


To our surprise there were few patents (only 9 out of 550) and these only started in 2006 (i.e. 10 years after the initial discovery). It turned out that software could not be patented software before then. Studies on innovation consider that the presence of an author from industry demonstrates that company's interest in the innovation under study. In the earth sciences, companies often co-author papers in order to test new methods on their own data.

One of the main contributions of our chapter is to identify this "*window*-*shopping effect*". We consider that co-authoring a single paper does not necessarily mean that the company has really adopted the method. More effort is required to absorb new methods. Instead, we postulate that co-authoring a second paper indicates a more serious interest: we call this "*repeat co*-*authoring*". We found that seven oil companies and consulting groups had co-authored two or more papers compared to 11 which had contributed to only 1; similarly five mining companies had co-authored two or more papers compared to 8 which contributed to only 1 paper. It was surprising not to see South American mining companies such Codelco and Vale among the mining companies. We were also curious to find out whether the 11 oil companies and 8 mining that only co-authored 1 paper had lost interest in the method or had trained staff to carry out studies for them or had commissioned consultants to do them.

To find out what happened we carried out a survey of academics, end-users in companies and consultants. Clearly there are limitations to what can be obtained from voluntary declarations; people may bias their answers but the survey gave us some ideas about what had happened. The key results were:


### *9.5.1 What Lessons Can Be Learned from the Study for Policy-Makers*

Firstly, while studies on patents can be very effective for assessing the industrial impact of new discoveries in some fields, they would have completely missed the target in this field, for two reasons: it was not possible to patent software developments until after 2005, and secondly even after that date, the new developments in mining software for these simulations were not patented.

Citation networks proved to be more effective than patents in this field. They allowed us to track the development of plurigaussian simulations within four different but inter-related academic domains and to industrial partners who publish in journals with academics. But even citations do not really allow us to get past the superficial "window-shopping" aspect of publications. Studying "repeat co-authoring" provides more in-depth insights; surveys of users give a clearer picture of whether companies are actually implementing new methods.

As Martin and Tang (2007) noted, firms and other users need to expend considerable effort to exploit scientific knowledge. In order to develop the in-house capability to carry out plurigaussian simulations, they need to acquire software and to train personnel. This study highlights the importance postgraduate training and masters' theses in transferring know-how and implicit knowledge to industry. The role of these courses in technology transfer to industry is undervalued in the current procedures for evaluating university departments.

**Acknowledgements** Many people have kindly answered our questions and provided feedback at seminars. The main ones are listed here in alphabetical order: Denis Allard, Hélène Beucher, Romain Bizet, Rich Chambers, Jean-Marc Chautru, Joao Felipe Costa, Sebastien Delamarre, Brigitte Doligez, Peter Dowd, Xavier Emery, Daiane Folle, Gaëlle Le Loc'h, Ute Mueller, Dean Oliver, Guiseppe Raspa, Christian Ravenne, Philippe Renard, Denis Schiozer, Olinto de Souza Gomes and Jeffrey Yarus.

### **Appendix 9.1**

When analyzing the citations we had been surprised that only 9 of the documents were patents (Table 9.8). Moreover these were all in the petroleum sector (either oil or history matching) and were lodged more than a decade after initial discovery of the method (1996). Why so few patents and why so late? One possible reason for this is that after the method had been published, the tacit knowledge was partly encapsulated in software and partly in knowing how to use the software. Firms of consultants who had acquired this knowledge, made a living carrying out case-studies for oil companies. Research universities which are also repositories of this knowledge, transmit it to students via postgraduate diploma courses, or masters or doctoral programs.

But the main reason for the lack of patents before 2006 (10 years after the initial discovery) is that oil companies and service providers only started patenting programs then. Until the late 1960s, computer programs were not considered patentable (Bender 1968); they could only be protected by copyright law. By the 1990s, it had become critical in the information economy to be able to protect IP on computer programs (Thurlow 1997). Ten years later the problem had been resolved. Merges (2007) commented*: the legal system is integrating software into the fabric of patent law, and software firms are integrating patents into the competitive fabric of the industry*. So this explains why patents only started to appear so late.


**Table 9.8** Patents

### **References**


Thurlow LC (1997) Needed: a new system of intellectual property rights. Harv Bus Rev 94–103


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **Chapter 10 Mathematical Geosciences: Local Singularity Analysis of Nonlinear Earth Processes and Extreme Geo-Events**

**Qiuming Cheng**

**Abstract** In the first part of the chapter, the status of the discipline of mathematical geosciences (MG) is reviewed and a new definition of MG as an interdisciplinary field of science is suggested. Similar to other disciplines such as geochemistry and geophysics, mathematical geosciences or geomathematics is the science of studying mathematical properties and processes of the Earth (and other planets) with prediction of its resources and changing environments. In the second part of the chapter, original research results are presented. The new concepts of fractal density and local singularity are introduced. In the context of fractal density and singularity a new power-law model is proposed to associate differential stress with depth increments at the phase transition zone in the Earth's lithosphere. A case study is utilized to demonstrate the application of local singularity analysis for modeling the clustering frequency—depth distribution of earthquakes from the Pacific subduction zones. Datasets of earthquakes with magnitudes of at least 3 were selected from the Ring of Fire, subduction zones of Pacific plates. The results show that datasets from the Pacific subduction zones except from northeastern zones depict a profound frequency —depth cluster around the Moho. Further it is demonstrated that the clusters of frequency—depth distributions of earthquakes in the colder and older southwestern boundaries of the Pacific plates generally depict stronger singularity than those obtained from the earthquakes in their hotter and younger eastern boundaries.

### **10.1 Introduction**

When this handbook is published, the International Association for Mathematical Geosciences (IAMG) is celebrating its 50th anniversary. Mathematical geosciences as a scientific discipline has become mature after half a century of development since the IAMG was established in 1968 at the 23rd International Geological

Q. Cheng (✉)

State Key Lab of Geological Processes and Mineral Resources, China University of Geosciences, Beijing 100083, China e-mail: Qiuming.cheng@iugs.org

<sup>©</sup> The Author(s) 2018 B. S. Daya Sagar et al. (eds.), *Handbook of Mathematical Geosciences*, https://doi.org/10.1007/978-3-319-78999-6\_10

Congress (IGC) in Prague. It had grown from mathematical geology to mathematical geosciences by the time its name was changed at the 32th IGC held in Oslo in 2008. Not only has the subject been accepted widely within the geoscience community but the association has also been recognized for its reputation and significant influence on the earth sciences in general. IAMG has affiliations with several major geoscience organizations including the International Union of Geological Sciences (IUGS), International Statistical Institute (ISI), and the International Union of Geodesy and Geophysics (IUGG). Diverse earth science topics have been published in IAMG conference proceedings and the IAMG journals (*Mathematical Geosciences*, *Computers & Geosciences* and *Natural Resources Research*). However, we have to realize that as a relatively young discipline, MG still has not been very widely accepted and is often ignored by main stream geoscientists. While several definitions and terminologies were proposed to describe mathematical geology, there have been few attempts to define mathematical geosciences. For example, mathematical geosciences have often simply been referred as applications of mathematical and statistical methods for the analysis of geological (earth science) data and the development of quantitative predictive models (Howarth 2017). The mission of the IAMG as shown on the IAMG website was defined as promoting the development and application of mathematics, statistics and informatics in the geosciences. Whether MG should be defined as a formal discipline of science or simply as applications of mathematics in the geosciences is a fundamental question with critical impact on the development of the subject. In this chapter, I will review the status of the discipline and suggests a new definition for MG followed by examples to demonstrate what contributions of MG have been made to earth science and what the current developments in the field are. For the first part I will elaborate on MG on the basis of literature review and for the second on my own research in nonlinear MG as an example of a new field of MG.

### **10.2 What Is Mathematical Geosciences or Geomathematics?**

One of the original definitions of mathematical geology was given by Vistelius (1962) and used in the name of the association: International Association for Mathematical Geology (IAMG) when it was first established in 1968. Geostatistics is one of the successful fields of IAMG, which originally was developed by MG scientists within the IAMG community. It has been used not only in the geosciences but later in many other fields of science as well. Geostatistics focuses on application of statistical methods in the earth sciences (e.g. Merriam 1970; McCammon 1975a, b) and still appears to be used by many in that sense. The term geomathematics was also used by several authors including Agterberg (1974) who used the term as the title of his two books (Agterberg 1974, 2014). After the name of the association was changed from mathematical geology to mathematical geosciences in 2008, the term mathematical geosciences more often appears in the literature of the IAMG and also in the titles of conferences, as well as in the name of its journal *Mathematical Geosciences*. When the author of the current chapter served as president of IAMG (2012–2016), dedication to IAMG was given by promoting the discipline of mathematical geosciences. Several notes on this were published in the President's Forums in IAMG newsletters (Issues 76–79th). The distinction between mathematical geology and mathematical geosciences is not simply in terminology but also in the scope of the discipline. While mathematical geology refers to a branch of geology, mathematical geosciences must be a subdiscipline of the geosciences which includes geology as one of its subfields. Other relevant subjects covered in the geosciences include but are not limited to geochemistry, geophysics, geobiology, and hydrology. Mathematical geosciences should be a discipline parallel to other subdisciplines in the geosciences such as geochemistry, geophysics and geobiology rather than a branch of geology. In the author's personal view this distinction is critical for the development of the discipline. Under the concept of mathematical geology, the subject is limited to the application of mathematics in geology but as mathematical geosciences just like geochemistry and geophysics, it serves the entire earth science. So, what should be the definition of mathematical geosciences or geomathematics and what are the roles mathematical geosciences should play in the family of geosciences? Here I will briefly elaborate on these questions and introduce several major contributions of MG to earth science. In order to provide a proper definition of mathematical geosciences, we should look at the definitions of other relevant disciplines such as geochemistry, geophysics and geobiology: • *Geophysics as a science of "the study of the earth's physical properties and of*


The definitions of the preceding relevant disciplines share the common concept of an interdisciplinary geoscience field. A similar definition was proposed by the author in 2014 with consultation of the IAMG Executive Committee Members and published in the President's Forum of IAMG newsletter (Issue No. 79).

• *As an interdisciplinary field merging mathematics, computer science and geosciences, MG is the science of studying mathematical properties and processes of the Earth (and other planets) with prediction and assessment of its resources and environments*

The ultimate question arising from this definition is what are the mathematical properties and processes of the Earth, with prediction and assessment of its resources and environments which have to be dealt with by mathematical geoscience for integration with other geoscience subdisciplines. Similar to other interdisciplinary fields including geochemistry and geophysics, mathematical subjects such as geometry, calculus, functional analysis, morphology, probability and mathematical statistics provide essential theory and methods for quantitative study of the Earth ranging from geometry and dynamics of the Earth, uncertainties of measurements, and observations for the prediction of Earth events.

### **10.3 What Contributions Has MG Made to the Geosciences?**

There are many examples demonstrating that MG has made indispensable contributions to the geosciences. For example, the mathematical model of the Earth's shape (e.g. Clark ellipsoid, and Hayford ellipsoid) which serves as the foundation of geodesy, navigation systems (e.g. GPS), remote sensing technology (RS) and geographical information systems (GIS), and the fast growing field of geomatics; the mathematical model of mantle convection and models for plate motions (McKenzie and Parker 1967) serve as foundation of plate tectonics, the most notable development of earth science in the last century; mathematical symmetry and symmetry operations as principles of crystallography and optical mineralogy (e.g. in 1830, Hessel proved the existence of the 32 groups of crystal symmetry) which constitute a foundation of solid earth science; the mathematical topological model as foundation of geographical information systems (e.g. as basis of spatial data modeling in ArcGIS), one of the most useful technologies in geoscience; mathematical and statistical theories providing foundations for describing the spatial distribution and correlation of elements, uncertainty and error bars in geochemistry including isotope geochemistry and geochronology as are also used for the geological time scale; mathematical modeling and uncertainty of prediction of climate change, a pressing issue of the geosciences; probability theory and stochastic models for prediction of energy and mineral resources, highly demanded by many nations for economic and societal development; geo-complexity theory such as fractals, multifractals, chaos and self-organized criticality for modeling and predicting singular events and extreme phenomenal issues; and information extraction (big data mining, machine learning, geo-intelligence) in the geosciences, just to name a few.

As the International Association for Mathematical Geosciences, IAMG has earned its reputation by promoting and fostering its members to make contributions to science. Original and significant studies have been published in IAMG journals, books and conference proceedings. However, a large amount of work is documented elsewhere in publications which cover almost every mathematical subject and aspects of geosciences ranging from statistical data analysis, geometrical modeling, dynamics and processes simulation, to prediction and assessment of Earth system. MG theories and methods have been applied not only in tackling conventional solid earth issues such as assessment of mineral and energy resources, but also in other fields including hydrology, climate change, water resources, alternative energy resources and environmental issues. While the importance of MG in the geosciences has been increasingly demonstrated, the discipline of MG has not yet been fully recognized and, to some extent, buried in oblivion. There is hardly any hiring of highly qualified personal (HQP) in academic institutions or industry with as job title Mathematical Geoscientist or Geomathematician. As a matter of fact, most of our IAMG members are employed with job titles such as geologists, geophysicists, geochemists, geodesists, computer scientists, mathematicians and geoinformatical specialists instead of MG or GM. University students who are talented in mathematics and geosciences wanting to pursue mathematical geoscience have to enroll in geophysics or other fields simply because MG does not exist as such in university programs, at least in most of the programs in developed nations. There are very few interdisciplinary university programs except actuarial science, mathematical physics and mathematics for business, which have mathematics as integral part of their subject. A common misconception is that learning mathematics either can only result in kinds of two jobs: pure mathematician or mathematics teacher, or as a prerequisite for other careers in engineering, science or business. This might be one of the reasons there are not so many students wanting to pursue mathematics related subjects in their choice of career. Thus, MG faces significant challenges when promoting MG as a discipline and for facilitating training and education of future generations. This presents the bottleneck for the IAMG to grow further and to become a more successful and influential association.

The International Year of Mathematics of Planet Earth (MPE) celebrated in 2013 generated a much needed publicity of mathematics in geoscience. Mathematical courses are offered in all schools from primary to high school to university. Earth science is also a common choice of topic in essays by students. Integration of math and earth subjects must provide proper and interesting topics for students' math or science projects. The mathematical and geoinformatical techniques learned by students early on are already powerful tools for exploring the Earth. An excellent example is the work headlined in the media with publication by a high school student Alice R. Zhai who has analyzed 73 tropical cyclones that made landfall in US and used multivariate regression to examine the dependence of hurricane economic loss on maximum wind speed and storm size. This study (Zhai and Jiang 2014) not only proposes a new model by which hurricane damage might be predicted but also provides new evidence showing the area-density power law property of extreme events which, as is to be introduced in the remainder of this chapter, has deep origins in nonlinear dynamics.

The development of modern information technology enables everyone to easily retrieve big data to support their studies via internet and web services in a cloud environment. To access and process huge amounts of data is no longer only for paid professionals. More and more specialized software packages and multi-media teaching and training materials or online courses available in the public domain with Twitter, Facebook and You Tube, provide new ways for self-learning. Online communication, discussion and consultation through the internet in and out of the classroom have become common for students. It should encourage middle school, high school and university students to develop their curiosity in, passion for, and dedication to mathematical geosciences.

### **10.4 Frontiers of Earth Science and Opportunities of MG**

IAMG has been rapidly expanding its scope from traditional geostatistics or statistical geology to more comprehensive interdisciplinary sciences for mathematically studying properties and processes of the Earth with prediction and assessment of its resources and environments. What are the current trends of MG and how are they associated with the Earth Science frontiers? It is impossible to create an accurate list of frontiers for MG. Of course, there exist several previous publications by IAMG members that have discussed past, current and future trends for the IAMG (Agterberg 2003). Here I will just share some thoughts based on my personal observations of several recent events and activities. Several international organizations have developed and published white papers illustrating prospective review on trends of scientific research within their organizations and strategic plans for the next 5–10 years; for example, the International Council for Science Union (ICSU) published its strategic research agenda for Future Earth 2025 Vision (http://www.futureearth.org/ sites/default/files/future-earth\_10-year-vision\_web.pdf); the International Union of Geological Sciences (IUGS) is jointly with UNESCO offering the International Geological Correlation Program (IGCP) in addition to various other big science programs and new initiatives such as the Resourcing Future Generations (RFG), an international collaborative program (http://iugs.org/uploads/RFG.pdf); the US National Science Foundation (NSF) has published a strategic plan for 2014–2018 (https://www.nsf.gov/publications/pub\_summ.jsp?ods\_key=nsf14043); the American Geophysics Union (AGU) produced a scientific trends report (https://about.agu. org/trends-earth-space-science/); the American Natural Science Foundation published its strategic plan for tectonics (https://docgo.net/national-science-foundationnsf-strategic-plan-fy-2006-2011-nsf-06-48); a white paper resulting from NSF sponsored workshops on "mathematics in geosciences" was published by a group of geoscientists in 2012 (https://cpb-us-e1.wpmucdn.com/sites.northwestern.edu/dist/ 8/1676/files/2017/10/agenda-xwphux.pdf), just to name a few. Relevant publications resulting from international conferences such as the International Geological Congress (IGCs), AGU, EGU, GSA as well as special articles in several journals such as Nature and Science have also been concerned with these issues. The following summary of key topics can be extracted from the preceding sources of information to reflect current trends and frontiers of the earth sciences. These key topics include but are not limited to data science, data analysis, big data and geo-intelligence, computation, inter-/multi-/cross-/transdisciplinary science, integrated models, uncertainty relative to observations and predictions, properties and dynamics of the planet, climate change, disruptive processes such as earthquakes and storms, and special studies of the Arctic, Antarctic and Tibet Plateau. The fundamental issues are for understanding Earth and environmental systems and their interactions with human activities, and for developing reliable monitoring systems, models, and information technologies for predictions and early warnings of large-scale and rapid change. The current challenges facing earth scientists are understanding and modeling the geo-complexity of the Earth and environmental systems with their interactions, chaotic nature and predictability of geo-processes, Earth singularity and human mitigation and adaptation to extreme events, plus observation and monitoring multiple-scale mixing nonlinear processes. Although most organizations neither recognize nor explicitly mention this, the majority of these frontiers are fundamentally related to MG. A long period of incremental advances of new mathematical theories and models in conjunction with modern technologies for solving these earth science problems may lead to creative leaps of innovation. MG has huge challenges and responsibilities facing the earth science frontiers. MG scientists are indeed at the frontier of earth science tackling fundamental problems of the Earth as can be evidenced by the recent advancements reflected in the topics of plenary presentations at IAMG conferences and in the best papers published in IAMG journals; for example, on multi-point geostatistics—a new field of spatial-temporal modeling (Mariethoz and Caers 2014); compositional data analysis—a new way to explore the composites of the Earth (Pawlowsky-Glahn et al. 2015); singularity analysis and singularity physics—new theory and methods of studying geodynamics and geo-complexity (Cheng 2007, 2017a; Agterberg 2017); big data visual analytics for exploratory data analysis; semantic web technology for geoinformation; uncertainty in ecosystem mapping by remote sensing; integrating structural geological data into inverse modeling frameworks; stationary and isotropic vector random fields on spheres; and mathematical morphology modeling, just to name a few.

### **10.5 Fractal Density and Singularity Analysis of Nonlinear Geo-Processes and Extreme Geo-Events**

For the past several decades nonlinear theory and geocomplexity marked an era of new geoscience that deals with nonlinear processes and extreme phenomena which occurred in the evolution of earth systems. Irregular geometry was not popularized in the past until the term "fractal" was coined by Mandelbrot in the 1970s. Fractal geometry rapidly became a new field of mathematics dealing with roughness and irregularity of geometries. For example, fractals have been used for modeling complex and self-similar patterns generated by nonlinear processes (Mandelbrot 1972; Feder 1988). The concept of fractals and fractal dimension was further extended to multifractals involving self-similar measures defined on support which can be fractal itself (Mandelbrot 1972; Meakin 1987; Schertzer and Lovejoy 1987). Multifractal measures have been further extended to fractal density in local singularity analysis (Cheng 1999a, 2001). In the following sections the concept of fractal density will be introduced and followed by discussion and application of new methods for fractal differential operation and fractal integration (Cheng 2017a).

### *10.5.1 Fractal Density*

Since the principle of density was discovered by the Greek scientist Archimedes approximately 2000 years ago, the well-known physical concept of density has become a fundamental property of mass or energy with a variety of applications. The density, or volumetric mass density, of a substance is its mass per unit volume. Density thus is a scale-independent property of material or energy treated as representing a fundamental physical parameter and variable in many physical models with applications in nearly all fields of study, ranging from physics to engineering, economics and the social sciences. Density often is characterized as unit of mass over volume (e.g., g/cm<sup>3</sup> , kg/m<sup>3</sup> ) or energy over volume (J/cm<sup>3</sup> , w/L<sup>3</sup> ). For example, the density of pure gold is 19.32 g/cm<sup>3</sup> , which is approximately 19 times as much as for an equal volume of water. The density of quartz is 2.65 g/cm<sup>3</sup> , which is much less than the density of gold. Therefore, the density of gold-mineralized quartz veins in hydrothermal mineral deposits is variable depending upon the concentration and distribution of gold in the quartz veins. Similarly, continental crust, which consists mostly of granitic rock, has a density of about 2.7 g/cm<sup>3</sup> and the Earth's mantle of ultramafic rock has a density of about 3.3 g/cm<sup>3</sup> . The density of seawater varies with temperature and salinity of the water. Although the density of seawater varies at different points in the ocean, a good estimate of its density at the ocean's surface is 1025 kg/m<sup>3</sup> or 1.025 g/cm<sup>3</sup> . Density of air is a temperature and pressure dependent parameter. For given temperature and pressure the density of air is independent of the volume of air. For a pure substance the density is independent of the volume of substance. However, for a heterogeneous substance density usually assumes different values depending upon purity and packaging. For example, rocks consisting of minerals with different densities have variable densities depending upon the proportions of the minerals. For a quartz vein with pure SiO2 the density of the vein should be equal to the density of quartz, 2.65 g/cm<sup>3</sup> . However, if the quartz vein involves gold mineralization, then the density of the quartz will be different from that of pure quartz relating to how the gold is distributed in the vein. At a location of higher concentration where a cluster of gold occurs in the quartz vein, the density of the vein is higher than that of pure quartz. From a fractal point of view, the structure of these types of gold distribution can be very irregular and then has to be described by using a non-integer or fractal dimension. Accordingly, the value of "volume" of the substance is lost. Instead the size of fractal is measurable only if it is measured in fractal dimensional space or as Hausdorff measure (Cheng 2017a). This means the ratio of mass over volume does not converge; and the density does not exist according to the ordinary density definition. In the following section it will be demonstrated that the concept of ordinary density of substance is only valid for substances with regular or ordinary structure. For substances packaged in a fractal manner, a new form of density is needed and the concept of ordinary density has to be generalized to a new form capable for quantifying the density of complex objects. It will also be demonstrated that the end products for many types of singular processes possess fractal mass density or energy density. The concepts of fractal density and local singularity analysis have been utilized in several dynamic models involving extreme processes (Cheng 2012, 2016, 2017b; Cheng and Agterberg 2009; Cheng and Sun 2017).

### *10.5.2 Density-Scale Power-Law Model and Singularity*

According to the concept of ordinary density, the mass density of an object (ρ) can be calculated by the following equation:

$$
\rho = \frac{m(\nu)}{\nu},\tag{10.1}
$$

where *m*(v) represents the mass contained in a volume (v) and *ρ* is the average density of an object. If the object is homogenous then the density calculated in Eq. (10.1) becomes independent of volume. The unit of the density is determined by the ratio of the mass and volume; for example, g/cm<sup>3</sup> . However, if the object has heterogeneous properties, the density may vary from place to place and the average density in Eq. (10.1) varies with different size of v, then a localized density must be calculated using the derivative of the mass over volume:

$$\rho = \frac{dm(\nu)}{d\nu} = \lim\_{\nu \to 0} \frac{m(\nu)}{\nu}. \tag{10.2}$$

The density in Eq. (10.2) exists only if the limit converges when the volume becomes infinitesimal. If the limit does not converge, then the density doesn't exist. As a generalization of Eq. (10.2), the following new Eq. (10.3) was introduced (Cheng 1999b, 2001) in which there exists a parameter α (with positive value) so that the limit converges:

$$\rho\_a = \lim\_{\nu \to 0} \frac{m(\nu)}{\nu^{\overline{\sigma}}}.\tag{10.3}$$

The value of ρα can be considered as a generalized density because the ordinary density defined in Eq. (10.2) becomes a special case of Eq. (10.3) when α = 3, the normal dimension of volume. This new density was named fractal density since it is defined as mass or energy per unit of "fractal set" (Cheng 1999b, 2001). The fractal density defined in Eq. (10.3) has as unit the ratio of mass to a fractal set of α dimensions; for example, g/cm<sup>α</sup> or kg/m<sup>α</sup> . Similarly, the units of fractal energy density can be J/cm<sup>α</sup> or w/L<sup>α</sup> . Combining Eqs. (10.2) and (10.3) yields the following relationship between ordinary density and fractal density:

$$
\rho(\nu) = \rho\_a \nu^{-\left[1 - a/3\right]}.\tag{10.4}
$$

The notation of fractal density used in Eqs. (10.3) and (10.4) can be replaced by the following general model associating the fractal density with the ratio of mass and scale (ε—linear size of an E-dimensional set):

$$
\rho(\varepsilon) = \rho\_a \varepsilon^{-|\mathbf{E} - a|}. \tag{10.5}
$$

This power-law relation between the ordinary density and scale is determined by two parameters: the fractal density ρα which is independent of scale and the exponent–singularity index α (fractal dimension), or Δα = E − α; the latter is also known as the co-dimension of fractal density. The singularity index (Δα) measures the deviation of the fractal dimension from the dimension of normal density. These two parameters (ρα and Δα) can be estimated from observed data by measuring the intercept and slope of a straight line on the log-log plot of m against ε (Cheng 1999b, 2007).

### *10.5.3 Multifractal Density*

If fractals refer to geometry with irregular shapes and self-similar geometrical properties, multifractals refer to self-similar measures defined on support which can be fractal (Mandelbrot 1983). Multifractals are defined as spatially intertwined fractals with variable fractal dimensions (e.g., Mandelbrot 1972; Cheng 1997). According to the distribution of measures (similar to the mother functions of sets) the support can be grouped into subsets which can be fractal with specific fractal dimension. Accordingly, there are two types of multifractal measures: continuous and discrete multifractals, the former refers to multifractals corresponding to an infinite number of intertwined fractals with continuous fractal dimension spectrum, whereas the latter refers to the limit number of intertwined fractals with discrete fractal dimensions (Cheng 1997). Multifractal measures are self-similar measures with multiple scale singularities which can be characterized by the Hőlder exponent (Mandelbrot 1989). In the multifractal paradigm the measure defined on a support can be expressed as

$$
\langle \mu(\varepsilon) \rangle \propto \varepsilon^a,\tag{10.6}
$$

where μ(ε) represents the measure defined on a set of linear scale ε, ∝ stands for ''proportional to'' when cell size ε approaches to zero, and α is the singularity index also known the Hőlder exponent (Mandelbrot 1989). This power law exists usually in a statistical sense and is represented as expectation <>. According to the distribution of α values, the entire support can be classified into subsets or fractals each with different singularity and accordingly different fractal dimensions. This is why it has been termed "multifractal". The distribution of singularity α in the mapped area can be described by the fractal dimension spectrum function *f*(α). The values of singularity and multifractal spectra can be estimated by several methods including box-counting and gliding-box based moment methods, and the wavelet method (Cheng 1999b). Singularity property has been commonly observed in geochemical and geophysical quantities (Cheng et al. 1994; Cheng 1999b, 2007). Since the common moment-based multifractal models are implemented according to partition functions of measures with additive property, most literature about multifractals focuses on the power law relations of multifractal measures and self-similarity of multifractal measures and few have neither emphasized the physical meaning nor the property of density of the multifractal measure. A density—area fractal model was proposed (Cheng et al. 1994) to associate the concentration with area of multifractal measure as

$$A(\geq \mathcal{C}) \propto \mathcal{C}^{-\beta},\tag{10.7}$$

where the area (A) is a function of element concentration above the threshold C. The model has also been applied to characterize other types of "concentration" such as density of faults per area (Agterberg et al. 1996), density of mineral deposits per area (Cheng and Agterberg 1996), stream density per drainage area (Cheng et al. 2001), and digital number of remote sensing images (Cheng and Li 2002), just to name a few. Further utilizing the idea of C-A model locally, the following power law relation was introduced to associate the density of multifractal measures with scale (Cheng 1999b)

$$
\rho(\varepsilon, \mathbf{x}) = c(\mathbf{x}) \varepsilon^{-\left[E - a(x)\right]},\tag{10.8}
$$

where E is the Euclidean dimension of the support (e.g., E = 1 for line, 2 for area and 3 for volume), x indicates the location, and c(x) and α(x) are constants with respect to scale ε but varying with location. The values of α(x) and c(x) can be estimated from the values *ρ ε*ð Þ , *<sup>x</sup>* calculated for different sizes <sup>ε</sup> around the location x by means of least squares using log-log paper. Both values can be mapped for visualization and interpretation. For convenience without loss of generality, in the rest of the paper the notation of x will be dropped from the formulation and the equation is assumed to hold locally. The singularity index α and constant c have the following properties (Cheng 1999b): if <sup>α</sup> = E, then *ρ ε*ð Þ = constant, independent of vicinity (scale) size <sup>ε</sup>; if <sup>α</sup> > E then *ρ ε*ð Þ is a decreasing function of <sup>ε</sup> which implies the convex property of *ρ ε*ð Þ; and <sup>α</sup> < E then *ρ ε*ð Þ is an increasing function of <sup>ε</sup> which implies the concave property of *ρ ε*ð Þ. Thus, the ordinary density obeys a power-law relationship with scale which has the following properties (Cheng 1999b, 2007):

$$\lim\_{\sigma \to 0} \rho = \begin{cases} 0, & \text{if } \mathfrak{a} > \mathcal{E}, \\ \infty, & \text{if } \mathfrak{a} < \mathcal{E}, \\ \quad c, & \text{if } \mathfrak{a} = \mathcal{E}. \end{cases} \tag{10.9}$$

In accordance with these properties, ordinary density becomes volume dependent when α ≠ E and it tends to either zero or infinity when the scale ε becomes infinitesimal. The constant c in Eq. (10.8) can be expressed in the following form:

$$c = \lim\_{\varepsilon \to 0} \rho(\varepsilon)\varepsilon^{E-a} = \lim\_{\varepsilon \to 0} \frac{\mu(\varepsilon)}{\varepsilon^a},\tag{10.10}$$

The constant c indeed is a convergent value of the ratio of measure (μ) over scale (ε) with fractal dimension. This quantity is usually termed scaling factor but it can be termed as a fractal density or Hausdorff density in analogy to the mass density which corresponds to ratio of measure over ordinary geometry with integer dimension (Cheng 2015). Therefore, while a unit of ordinary density is g/m<sup>E</sup> , the unit of fractal density becomes g/m<sup>α</sup> .

### *10.5.4 Fractal Density Structure and Clustering Distribution*

The terminology of fractal density has been explained in several papers with different emphases, but the meanings of the concepts used are variable. For example, the term "fractal density" has been used to refer the number of fractals per area (Hou and Wu 1989) which does not mean the same as the concept introduced in the current paper. Tatekawa and Maeda (2001) analyzed time evolution of fractal density perturbations in the Einstein-de Sitter universe, in which the emphasis is on how the perturbation evolves and what kind of nonlinear structure will come out. Similarly, Federrath et al. (2009) has used fractal density structure in supersonic isothermal turbulence when referring to density structure. Gromov et al. (2001) used fractal density to describe fractal galaxy distribution. Carpinteri et al. (2009) used the term to describe the mean fractal density of microcrack barycenters. Pope and Mackenzie (1988) introduced the concept of fractal density for describing the morphology of fractal growth model in the evolution of gels from solution. They define the fractal density ρ which follows the relation

$$F = \frac{\rho}{\rho\_0} = \left(\frac{r\_0}{r}\right)^{3-D},\tag{10.11}$$

where D is the fractal dimension of fractal growth, the F is the relative fractal density at radius r (r ≥ r0), with r0 and ρ<sup>0</sup> being the core radius and core density, respectively. The core acts mathematically as a reference point for calculating the decrease in density as the fractal increases in size. A similar clustering fractal growth density function was used to describe tumor growth in fractal space-time with temporal density (Paramanathan and Uthayakumar 2011).

From the preceding publications we can see that in earlier studies by other authors the term of fractal density was introduced mainly for description of morphology and patterns of fractals and fractal growth modeling. The current research introduces the fractal density as a generalization of ordinary density of substance or energy to represent a fundamental new parameter or variable involved in dynamic systems.

### **10.6 Fractal Integral and Fractal Differential Operations of Nonlinear Functions**

As mentioned in Eq. (10.2) for heterogenetic matter or substances, the derivative of mass over scale can be used for defining localized density of substance. Accordingly, the mass or volume of a heterogenetic substance can be calculated using integration. Obviously, integration and differentiation are two fundamental operations in calculus and used for many mathematical and physical subjects. The traditional integral and differential operations are defined on the basis of additive property of Lebesgue measure. When the measure no longer possesses additive property, then the classical integral and differential may not exist. Therefore, the ordinary integral and differential operations are not applicable to fractal density with singularity. The author has proposed the following fractal integral and differential (Cheng 2017a)

$$f\_a'(\mathbf{x}\_0) = \frac{df(\mathbf{x})}{d\mathbf{x}^a} = \lim\_{\Delta \mathbf{x} \to 0} \frac{\Delta f(\mathbf{x})}{\left(\Delta \mathbf{x}\right)^a} = \lim\_{\mathbf{x} \to \mathbf{x}\_0} \frac{f(\mathbf{x}) - f(\mathbf{x}\_0)}{\left(\mathbf{x} - \mathbf{x}\_0\right)^a},\tag{10.12}$$

where Δ*f*(x) and Δx represent the increments of a function *f*(x) for an increment of x. The convergence of the limit in Eq. (10.12) can be defined as the α-fractal derivative of the function *f*(x). Similarly, we can define the fractal integral of the function *f*(x) as follows

$$\int f(\mathbf{x})d\mathbf{x}^{\mathbf{a}} = \lim\_{\Delta \mathbf{x} \to 0} \sum f(\mathbf{x}\_{i})(\Delta \mathbf{x})^{\mathbf{a}},\tag{10.13}$$

where *f*(xi) is the magnitude of the function *f*(x) over the small range [xi, xi + Δx]. If the limit of Eq. (10.13) converges, then it can be named the α-fractal integral of the function *f*(x). It must be kept in mind that the fractal derivative defined in this paper is different from the fractional derivative (fractional order) known in the literature as *f* (v)(x), where v can be a non-integer order. The fractional derivative assumes that the normal integer order derivative *f* (n)(x) does exist. The fractal derivative is based on fractal dimension of the measure whereas the fractional derivative is based on fractional order of derivative defined on normal measure. As an example, let us take a power-law function to demonstrate the fractal derivative. Assume a power law function, *f*(x) = c(x − x0) b , with ordinary derivative of the function *f* ′ (x) = cb(x − x0) b−1 , which does not exist at x = x0 if 0 < b < 1. The integral of the function then is R *<sup>f</sup>*ðx) dx = c ̸ðb+1Þð<sup>x</sup> <sup>−</sup>x0<sup>Þ</sup> b+1, which does not converge if b < −1 at x = x0. According to Eq. (10.12), the fractal derivative at x=x0 exists and *f*α′(x) = c, if α = b, or *f*α′(x) = 0, if α < b and *f*α′(x) = ∞ if α > b.

A new concept of Hausdorff derivative underlying the Hausdorff dimension of metric space/time was proposed by Chen (2006) who introduced the systematic mathematical operation of Hausdorff derivative with applications to derive a linear anomalous transport–diffusion equation underlying an anomalous diffusion process. The Hausdorff derivative operation proposed by Chen (2006) is expressed as follows

$$\frac{\partial f(\mathbf{x})}{\partial \mathbf{x}^{a}} = \lim\_{\mathbf{x} \to \mathbf{x}\_{0}} \frac{f(\mathbf{x}) - f(\mathbf{x}\_{0})}{\mathbf{x}^{a} - \mathbf{x}\_{0}^{a}} = \frac{\partial f(\hat{\mathbf{x}})}{\partial \hat{\mathbf{x}}} \,, \tag{10.14}$$

This formalism was termed the Hausdorff derivative of a function *f*(x) with respect to fractal measure x<sup>α</sup> .

It has to be pointed out that the fractal derivation defined in Eq. (10.12) is different from that defined in Eq. (10.14) considering that, in general, if x0 ≠ 0, then

$$(\Delta \boldsymbol{x}^{a}) = (\boldsymbol{x} - \boldsymbol{x}\_{0})^{a} \neq \Delta \boldsymbol{x}^{a} = \boldsymbol{x}^{a} - \boldsymbol{x}\_{0}^{a}.\tag{10.15}$$

The two sides in Eq. (10.15) become equal only if x0 = 0. Otherwise, according to Taylor expansion, we can obtain Δ*x<sup>α</sup>* =*x<sup>α</sup>* − *x<sup>α</sup>* <sup>0</sup> = *αx<sup>α</sup>* <sup>−</sup><sup>1</sup> <sup>0</sup> <sup>Δ</sup>*<sup>x</sup>* <sup>+</sup>*o*ð Þ <sup>Δ</sup>*<sup>x</sup>* , so substitution into Eq. (10.14) gives

$$\frac{\partial f(\hat{\mathbf{x}})}{\partial \hat{\mathbf{x}}} = \lim\_{\mathbf{x} \to \mathbf{x}\_0} \frac{f(\mathbf{x}) - f(\mathbf{x}\_0)}{\mathbf{x}^a - \mathbf{x}\_0^a} = \frac{1}{a \mathbf{x}\_0^{a-1}} \frac{\partial f(\mathbf{x})}{\partial \mathbf{x}},\tag{10.16}$$

which implies that the derivative of *f*(x) defined in Eq. (10.14) is indeed corresponding to the ordinary derivative except for the factor <sup>1</sup> *αxα*−<sup>1</sup> 0 . Reconsidering the example used previously with *f*(x) = c(x − x0) b , the derivative of Eq. (10.14) at x=x0 does not exist if b < 1.

### **10.7 Earth Dynamic Processes and Extreme Events**

In the remainder of this chapter I demonstrate that fractal density (Δα ≠ 0) characterizes anomalous mass accumulation or energy release caused by extreme geo-processes, which occurred in the Earth's lithosphere originated from cascade earth dynamics (plumes, mantle convection and plate tectonics) and self-organized criticality involved in phase transitions (avalanches of slab breakoffs, faults, and volcanic eruptions).

Mantle convection at high Rayleigh number generates thermal plumes episodically which upon arrival in the crust could cause major flood basalt events, igneous provinces as well as spreading of continents and mid ocean ridges (Richards et al. 1989; White and McKenzie 1989). On a larger scale, Wilson cycles (Wilson 1966) corresponding to the periodic fragmentation and reformation of supercontinents could be linked to temporal variability in plate tectonics. Numerous studies have revealed that mantle convections can induce exchange of mass between upper and lower mantle across the endothermic phase transition zone at about 660 km. The cold downwelling material penetrates into the lower layer and, simultaneously, the hot upwelling fluid is pushed into the upper layer. The exchange of mass between the upper and lower mantle layers can occur in short bursts (often described with superlatives such as "catastrophic", overturn, "avalanche" subduction, or "superplumes") (Zhong and Gurnis 1994). The quick injection of lower mantle hot fluid into the upper mantle can cause not only mantle heterogeneity but also anomalous thermal distribution near the surface (Le Bars and Davaille 2004). This has been considered to be the first order cause of vigorous magmatism. Deep subductions of continental crust into the deep earth interior and rebounded back to the surface of the Earth have been ascertained by the discoveries of regional metamorphic coesite (Chopin 1984; Smith 1984), and subsequently by unusual ultrahigh pressure (UHP) terranes (Hacker and Gerya 2013).

Within the lithosphere there are various types of "catastrophic" events occurring during plate subduction. Formation of magmatic arc can be caused by subduction in which the subducting or subducted oceanic crust material releases volatiles (e.g. H2O and CO2) which cause partial melting of the mantle and form magma at depth under the overriding plate. Earthquakes occur at certain depths at the edges of three types of plate boundaries: convergent (subductions and collisions), divergent, and transformative.

### *10.7.1 Phase Transition*

From mathematical and physical points of view, the mechanisms that have been proved to exist correspond to the generation of power-law distributions including but not limited to phase transition (PT), self-organized criticality (SOC) and multiplicative cascade processes (MCP) (Newman 2005; Lovejoy et al. 2009). I will elaborate on each of these mechanisms in relation to mantle convections, plumes and lithosphere rheology induced tectonic events. The phase of a thermodynamic system and the state of matter in a normal system have uniform physical properties. Common phases include liquid phase, solid phase and vapor phase of chemical components which exist under certain pressure and temperature (P-T) conditions. Materials in different phases have their distinct properties such as liquid usually having higher density and smaller specific volume in comparison with gas. However, in phase transition conditions, multiple phases coexist within the same system such as liquid and vapor in magma and hydrothermal systems under proper P-T conditions. At a critical condition (critical point on phase diagram) liquid and vapor become indistinguishable and beyond this point the fluid and gas become so-called supercritical fluid, representing a special phase of matter which can effuse through solids like a gas, and dissolve materials like a liquid (McMillan and Stanley 2010). The critical point for water occurs at temperature (374 °C) and pressure (22 MPa). It has been found that the critical point is so peculiar that close to it, small changes in pressure or temperature result in large changes in density and other density related properties such as viscosity, relative permittivity, heat capacity and solubility. The special critical point phenomena can be expressed by the following empirical power law functions (Sengers and Levelt Sengers 1968, 1986):

$$
\Delta \rho = c \left(\Delta P\right)^{1/3}, \quad \Delta \rho = c \left(\Delta \mathbf{T}\right)^{1/2}, \tag{10.17}
$$

where Δρ, ΔP, and ΔT represent the changes of density, pressure and temperature, respectively along the coexistence curve. These power-law relations hold for small changes of temperature or pressure from the condition at the critical point of the system. Although the two functions of Eq. (10.17) show continuity at zero increment with Δρ = 0, ΔP = 0, and ΔT = 0, the first order derivatives of density versus either temperature or pressure (change rate of density difference) do not exist or show singularity at ΔP = 0 and ΔT = 0 as shown in the following forms

$$\frac{\Delta\rho}{\Delta\mathbf{T}} = c\Delta T^{-1/2}, \frac{\Delta\rho}{\Delta\mathbf{P}} = c\Delta P^{-2/3} \tag{10.18}$$

These properties describe the phenomena of property change such as fractal density (density jump) at the phase transition zone. In addition, the ratio of increments of temperature and pressure depict power-law relations <sup>Δ</sup>P <sup>Δ</sup><sup>T</sup> <sup>=</sup>*c*Δ*<sup>P</sup>* <sup>−</sup><sup>1</sup> ̸3. Such power-law relation implies that the Clapeyron slope could become infinity or a singularity when approaching the coexistence curve. Clapeyron slope and density jump are critical parameters in numerical simulation of mantle convection; for example, Korenaga (2004) developed a numerical model to simulate mantle mixing and continental breakup magmatism by assigning a Clapeyron slope of −2 MPa/K and a density jump of 10% for the endothermic phase transition at 660 km depth. The episodicity of convection induced by the endothermic phase changes strongly depends on plate length, rheology, and Clapeyron slope (Zhong and Gurnis 1994). Ogawa and Yanagisawa (2014) have developed models with small Clapeyron slope −0.2 to −1 MPa/K for simulating convections from punctuated layered convection to whole-mantle convection in modeling mantle evolution on Venus due to magmatism and phase transitions. Their models indicate that the earlier stage layered mantle convection is punctuated by repeated bursts of hot material from the deep mantle to the surface. Other phenomena of phase transition may occur at the boundary of deeply subducted slabs. Due to subduction of oceanic lithosphere underneath the continental lithosphere, solid phase lithosphere can be partially melted to facilitate formation of magma. During the progress of subduction, H2O and other volatile components contained in the rocks are progressively released from the slab at different depths. Fluids or melts released at greater depths will be in supercritical fluid phase which hydrates the mantle and causes partial mantle melting. This eventually leads to deeply rooted magma which provides the source for magmatic and volcanic arcs located above the subduction zones. Partial melting in lower crust and mantle also causes strain rate change of the lithosphere which facilitates formation of intermediate and deep earthquakes (Dimanov et al. 2000). The processes of fluid release and migration are complex and, to a large extent, their details still remain unknown. Due to the great depth of subduction the fluid released may be in supercritical condition with, as mentioned earlier, fractal density with strong solvent strength facilitating the hydration and metasomatism of mantle rocks. When the pressure and temperature are reduced to around the critical point, the system goes through a great reduction of gradient of density, accordingly increasing the specific volume which further enlarges porous space and fractures rocks thus in turn facilitating the formation of magma and earthquakes through positive feedback processes.

### *10.7.2 Self-organized Criticality*

The phenomena associated with continuous phase transitions are called critical phenomena, and these are often related to so-called self-organized criticality (SOC). SOC is commonly illustrated conceptually with avalanches resulting from piles of sand which generate a power-law number-size distribution of avalanche magnitudes (Bak et al. 1987). At the criticality point in a SOC phenomenon a small continuous input to the system can cause sudden and discontinuous outputs or avalanches. For example, a fault occurs in broken brittle rock strata when an extra stress is added to change the system at the criticality point. The size and number of faults generated may follow a power law distribution with a small number of large faults and a large number of small faults. SOC is similar to critical point phase transition since both processes involve anomalous state change caused by a minor continuous input pulse at the critical condition point. Numerous studies have also pointed out the effect of the 660-km endothermic phase transition on convection. This could actually generate the periodic occurrence of abrupt changes in convective mode (660-km layered/whole mantle), consecutive with the sudden flushing of oceanic plates previously accumulated above the transition zone (e.g., Le Bars and Davaille 2004). Many numerical simulations have demonstrated multiple scale and sizeable whole mantle convection, and sublithospheric convection can bring up dense fertile mantle materials from the lower mantle to the upper mantle (Korenaga 2004). Cold downwellings are temporarily stopped by the 660 km endothermic phase change but sink rapidly into the lower mantle (Tackley et al. 1993). The intermittence of layering reflects accumulation and release of negative buoyancy above the endothermic phase boundary (Machetel and Weber 1991; Tackley et al. 1993). The exchange of mass between upper and lower layers can occur in short bursts (Zhong and Gurnis 1994). Although these types of avalanching behaviors are not as easy to test as those of sand piles, one might reasonably assume that these types of processes with SOC nature can generate end products with power law distributions. As a matter of fact, SOC phenomena have been commonly considered to describe extreme geo-events in plate tectonics. Such examples may include but are not limited to earthquakes (Gutenberg and Richter 1944; Turcotte 1997), volcanic eruption durations (Cannavò and Nunnari 2016), plate sizes (Sornette and Pisarenko 2003), slab breakoff (Condie 1998), areal size of magmatism (Pelletier 1999), mineral deposits (Agterberg 1995; Cheng 1999b; Maier and Groves 2011), heat flow over mid-ocean ridges (Cheng 2016), episodic evolution of supercontinents and crustal growth (Cheng 2017b), and energy—probability of earthquakes (Cheng and Sun 2017). Other examples can be found in the book authored by Sornette (2004). The processes involved in response to the preceding extreme events create end products which can be described by frequency—size or frequency—time power law relations. Based on the above reasoning, we may expect lithospheric root detachments and slab breakoffs that occurred during subduction are of difference sizes which follow power-law distributions. Some of these small-sized events may not be noticeable on the surface due to small impact on the global system, but the large detachments and slab breakoffs can cause significant impact on syn- to post-collisional magmatism and metamorphism. The size—frequency distribution of these types of events can be modelled by the following general power-law relation

$$\mathbf{N}(>\mathbf{A}) = c\mathbf{A}^{-\mathbf{b}},\tag{10.19}$$

where A represents the size of event and N(>A) the cumulative number of events with size greater than the threshold A. This power-law function involves two constant values: c and b. For example, the well-known Gutenberg-Richter power-law distribution relates the number of large earthquakes to their sizes (Gutenberg and Richter 1944; Turcotte 1997). The exponent, b-value, has been commonly used for predictive purposes. The exponential b-value was found to be internally related to singularity in terms of fractal probability density (Cheng and Sun 2017) with

$$E($$

where E(<P) represents the minimum energy released by large earthquakes, with occurrence probability less than P. This equation indicates that the minimum energy released by large earthquakes follows a power-law relation ð*<sup>β</sup>* <sup>=</sup> <sup>2</sup> <sup>3</sup> *<sup>b</sup>*<sup>Þ</sup> for probability of earthquake occurrence with energy greater than E. This model implies that the smaller the probability (P) of a large earthquake, the larger its energy release (E).

### *10.7.3 Multiplicative Cascade Processes*

Multiplicative cascade processes (MCP) are iterative multiplicative processes across multiple scales, which involve positive or negative feedback to generate extreme values that follow multifractal power-law distributions (power-law distributions with multiple exponents) with self-similarities and singularities (Meakin 1987; Scherzter and Lovejoy 1987; Agterberg 2007; Cheng 2014). Examples of MCP are common in the study of geocomplexity such as formation of clouds, severe weather and storms (Scherzter and Lovejoy 1987; Malamud et al. 1996; Turcotte 1997; Veneziano and Furcolo 2002), to just name a few. In terms of mantle convection, the convection processes can be viewed as multiplicative cascade processes that create heterogeneity of the mantle by recycling the materials from upper crust to mantle. On a large scale, Wilson cycle cascade evolution involves the opening and closing of an individual oceanic basin, plate drift, plate subduction and plate collision, involving the recycling of lithosphere material and causing extreme events at the interface of phase transition zones or zones around plate boundaries. Depending on the properties of subduction and other factors, plate subduction may cause slab deformation, erosion and breakoff, deep subduction, and collision of continents. These events are responsible for formation of extreme events such as magmatism and earthquakes. During such processes changes of pressure and temperature as well as water content often provides a positive feedback effect on causes of melting or partial melting of lithosphere and the generation of magma reservoirs and seismicity. In the context of multiplicative cascade processes, the mass and energy distribution resulting from these processes often are proved to have self-similarity and singularity which can be modelled by multifractal distributions (Meakin 1987; Schertzer and Lovejoy 1987; Cheng and Agterberg 2009).

The aforementioned mechanisms (PT, SOC and MCP) can coexist in the evolution of earth dynamics systems which cause cascade effects for anomalous diffusion and strain rate originating earthquakes or magmatism creating flare up formation of magmatic activity or cluster frequency-depth distribution of earthquakes. Based on possible mechanisms (PT, SOC and MCP) corresponding to power-law distributions, the fractal density (power-law density) and the singularity analysis method can be used to characterize the causational relations between extreme events such as magmatic activities and earthquakes and the aforementioned nonlinear mechanisms. In the following section a case study of earthquakes will be used to demonstrate the effect of phase transition on formation and distribution of earthquakes that occur along Pacific plate subduction zones.

### **10.8 Fractal Density of Lithosphere Rheology in Phase Transition Zones and Association with Earthquakes**

### *10.8.1 Rheology Constitutive Equation*

In the study of earth tectonics, rheology is an important concept describing rock properties with respect to flow behavior which can be characterized through the following empirical constitutive equation associating stress and strain rate (e.g., Dimanov et al. 1998).

$$\dot{\varepsilon} = A \sigma^n d^{-m} f\_{H\_2O}^r e^{-\frac{Q + PV}{RT}},\tag{10.21}$$

where *ε*̇represents the strain rate, σ—the stress, n—the stress exponent; d represents the grain size, m is the grain-size exponent, *fH*2*<sup>O</sup>*—the water fugacity, and r—the fugacity exponent, Q—the activation energy, P—the pressure, V—the activation volume, T—the absolute temperature, while R is the molar gas constant, and A—a material constant. The constitutive Eq. (10.21) is often utilized in the literature for describing rheology of ductile crust and since it is so well-known it often is provided without citation and reference. Several authors have investigated this equation by various methods such as by physical experiments (Pharr and Ashby 1983; Dimanov et al. 1998). The parameters involved in the equation can be estimated using a log-linear model except for the last combined term

$$\log(\dot{\varepsilon}) = \log A + n \log(\sigma) - m \log(d) + r \log(f\_{H\_2O}) - \frac{Q + PV}{RT} \,. \tag{10.22}$$

Effects of some of the parameters have been summarized by several authors (e.g., Bürgmann and Dresen 2008). For example, diffusion-controlled deformation is linear in stress with n = 1. Different inverse dependencies on grain size have been predicted for lattice diffusion– and grain boundary diffusion–controlled creep with m = 2 and m = 3, respectively. Creep of fine-grained materials involves grain boundary sliding, which may be controlled by grain boundary diffusion (n = 1) or by dislocation motion (n = 2). For climb-controlled dislocation creep, deformation is commonly assumed to be grainsize insensitive (m = 0) with a stress exponent of n=3–6 (e.g., Bürgmann and Dresen 2008). Materials for which strain rate is proportional to stress raised to a power n > 1 are referred to as having a power-law rheology, whose effective viscosity (*μ*= *σ* ̸*ε*̇∝*σ*<sup>1</sup>−*<sup>n</sup>*) decreases when stress increases. The significant effects of melt distribution on the rheology of rocks have been reported by many authors (e.g., Dimanov et al. 1998, 2000). In general, the strain rate is proportional to the water fugacity. The general bivariate relations between the strain rate and other factors considered in the equation are valid and can be applied to characterize the general associations of factors considered in the system (Wang 2016; Dimanov et al. 2000). However, the equation is valid for normal media that generally do not possess singularity for non-zero values of the factors. It is neither possible to use this equation to describe the singular behaviors of constitutive equation in phase transition nor to directly use it to delineate zones of phase transition. Variable depth-frequency distribution of crustal earthquakes and lithological compositions are often integrated to characterize crust deformation in relation to variations of tectonic styles (Mouthereau and Petit 2003). In the following section my attempt is to derive a proper equation to characterize the rheology in phase transition zones.

### *10.8.2 Rheology and Phase Transition*

In order to explain the phase transition zones in the lithosphere associating the effect of phase transition with origin of seismicity and magmatism, one needs to link the rheology to depth of lithosphere. It has been generally accepted that in the brittle crust, frictional strength increases linearly with depth. Phase transitions separate regions into groups of rocks dominated by quartz, feldspar and olivine, respectively; and these regions are characterized by brittle or plastic properties of lithosphere (e.g., Jackson 2002; Bürgmann and Dresen 2008). It was suggested by Sibson (1974) that brittle strength in the crust can be approximated by the Sibson's formulation in which the coefficients of friction and cohesion for pre-fractured rocks are equal to internal friction and cohesion for intact samples:

$$
\sigma = \sigma\_1 - \sigma\_3 = \beta \rho \lg z (1 - \lambda),
\tag{10.23}
$$

where *σ* =*σ*<sup>1</sup> − *σ*<sup>3</sup> represents differential stress, z is depth, ρ is average density of the overburden, g is acceleration of gravity, β is a coefficient which depends on the type of faulting, and λ represents the pore fluid ratio. Under hydrostatic pressure, λ is 0.36, and it is 0 and 0.7 for dry and wet conditions, respectively (Mouthereau and Petit 2003). In order to discuss the behavior of rheology around phase transition, let us define depth at the center of the phase transition zone as z0, which will serve as reference of coordinate for further comparison. Let us also denote a small distance increment (in depth) around the phase transition zone as Δz = abs(z − z0), and the corresponding increment of differential stress around the phase transition zone as <sup>Δ</sup>*<sup>σ</sup>* = absfð*σ*<sup>1</sup> <sup>−</sup>*σ*<sup>3</sup>Þðz<sup>Þ</sup> <sup>−</sup>ð*σ*<sup>1</sup> <sup>−</sup> *<sup>σ</sup>*<sup>3</sup>Þðz0Þg. When <sup>Δ</sup>z is very small around the phase transition center z0, then we can derive the following approximation assuming changes of depth z, β and λ are neglectable:

$$
\Delta\sigma \propto \Delta\rho,\tag{10.24}
$$

According to the phase transition property of density and temperature or pressure similar to Eq. (10.17) we can assume the mass density of lithosphere around the phase transition center to be

$$
\Delta\rho \propto \left(\Delta T\right)^b. \tag{10.25}
$$

Further assuming that the temperature and depth increments are linearly associated when the depth increment is very small, we obtain

$$
\Delta\rho \propto \left(\Delta T\right)^b \propto \left(\Delta z\right)^b,\tag{10.26}
$$

Therefore, the derivative of Eq. (10.26) satisfies

$$\frac{\Delta\rho}{\Delta z} \propto (\Delta z)^{b-1},\tag{10.27}$$

This result implies that change rate (<sup>Δ</sup>*<sup>ρ</sup>* Δ*z* ) of density with depth follows a power-law relation with the increment of depth (Δz). If the exponent b is less than 1, the change rate approaches infinity when Δz → 0, which implies that the change rate of differential stress, according to Eqs. (10.27) and (10.24), can become infinitely large. Assuming the other factors to be negligibly small in Eq. (10.21) when Δz is very small, we obtain

$$\frac{\Delta \dot{\varepsilon}}{\Delta z} \propto (\Delta z)^{b-1},\tag{10.28}$$

If the exponent b is less than 1, then the change of strain rate per increment of depth approaches infinity when Δz → 0. It must be reminded that the derivation of the new Eqs. (10.24–10.28) is based on several assumptions involving first order approximations of factors which may need further theoretical justification (detailed discussion will be published elsewhere). Nevertheless, the results obtained here might be the first power-law model providing possible quantitative description of the singularities of differential stress at the phase transition as indicated in the schematic diagram (Fig. 10.1).

### *10.8.3 Frequency—Depth Fractal Density Distribution and Singularity Analysis of Earthquakes*

In order to demonstrate the effect of differential stress caused by phase transition on formation and distribution of earthquakes, several datasets of earthquakes with magnitudes three or above were selected for several small regions along the Ring of Fire, the Pacific plate boundaries. Data were downloaded from the USGS website under the section of USGS Earthquake Hazards Program (https://earthquake.usgs. gov/earthquakes/map/). The locations of the 30 small areas selected from Aleutian Islands, Kuril Islands, Mariana, Tonga Trench, Mexico, northern Chile and southern Chile are shown in Fig. 10.2. Several hundreds to thousands of earthquakes are

**Fig. 10.1** Strength envelopes of differential stress versus depth for a general lithospheric condition to illustrate the potential effects of phase transition. The equations are about increment rate of differential stress around the depth of phase transition zone. Notations and discussions about the equations are given in the text

selected in each area. These areas were chosen within a short range from the plate boundaries to ensure they contain enough large earthquakes which occurred along subduction zones with similar properties.

The main purpose of the case study here is to validate whether earthquakes that occurred in the subduction zones possess clustering with fractal density; therefore, we choose earthquakes in the depth around the Moho ranging from 30 to 100 km. Considering the issue of depth of shallow earthquakes being set a "normal" depth of 33 km or default depths of 5 or 10 km when depths are poorly constrained by available seismic data, we only analyze the earthquakes with occurring depth ranging 34 to 100 km. The numbers of earthquakes in each dataset were grouped on the basis of 10-km depth frequency bins. A profound peak of frequency distributions can be observed around 33 km in all datasets except for western California. To reduce the effect of the "default peak" at depth 33 km, further analysis of the frequency data will be based on earthquakes with depth from 34 km downward. As an example, the frequency—depth distribution of 1263 earthquakes with magnitude greater or equal to 3 and depths between 34 to 100 km from the Tonga region are shown in Fig. 10.3a with the data grouped in a bin of 10 km (frequency—depth distributions for other datasets are not shown here). This graph shows a profound frequency peak at 34–44 km. By eye examination one can see the frequency around the peak within 60 km (from 34 to 94 km) decaying rapidly from the location of the peak at 34 km downward. To validate the fractal density of frequency clustering distribution, the following local number-depth density of earthquakes around the peak z0 was constructed

**Fig. 10.2** Study areas located along the Pacific plate boundaries. Data containing earthquakes with magnitudes M ≥ 3, and their depths were downloaded from the USGS website. The yellow dots represent the location of study area and the size of the dot represent level of singularity calculated using the model introduced in the current paper

$$\rho(\Delta z) = \frac{\text{total number of earthquakes in depth range } z\_0 + \Delta z}{\Delta z} = c \Delta z^{-b}, \quad (10.29)$$

where Δz is the window size from z0, c and b are two parameters to be estimated using the local singularity analysis method (LSA) with windows of multiple sizes: Δz = 10, 20, …, 60 km. The results are calculated for all 30 datasets. Several selected examples are shown in Fig. 10.3b–h. There is no significant peak at 33 km in the datasets from the areas of western California. The decay curves in Fig. 10.3 are least squares fittings to the data with power-law functions. The results estimated from the six datasets give b = 0.90 (E13), 0.44 (E7), 0.27 (E2), 0.49 (N2), 0.55 (N5), 0.69 (W4) and 0.74 (W11) respectively. Coefficients of determination for the least squares fittings to all six datasets are high with R2 > 0.98 (student t-value > 14), indicating statistically significant power-law models fitted to the data.

The results obtained by local singularity analysis of all 30 datasets (except E1, E3, E8) demonstrate that the frequency—depth distributions for large earthquakes (M ≥ 3) are not uniformly distributed but show clustering which can be modelled by using the local fractal density model of Eq. (10.29). The datasets E1, E3 and E8 show linear decay instead of power-law decay. Moreover, the results (shown as yellow dots in Fig. 10.2) demonstrate that the frequency—depth density distributions of earthquakes from the southwestern boundaries of the Pacific plates depict stronger singularities than those of earthquakes from the southeastern boundaries of

**Fig. 10.3** Distribution of frequency density of earthquakes with magnitudes equal to or greater than 3 from around Moho at 34 km downward. **a** Frequency—depth distribution of earthquakes from Tonga region; **b**–**h** Distribution of decay of frequency density of earthquakes (#/km) with depths from around peak at 34 km downward; Power-law functions were fitted to the observed data by least squares

the Pacific Plates except the earthquakes in conjugation regions of three plate boundaries (e.g., N4, N5, W4, W5, W9-W11, E4, E13, E14) that depict stronger singularity. This finding might be significant for understanding the different mechanisms causing earthquakes between the eastern and western Pacific plate boundaries. As reported in the literature, the western boundaries of the Pacific plates are generally colder and older in comparison with the eastern boundaries (Kong et al. 2016; Okazakl and Hirth 2016). Low slab temperatures resulting from faster subduction cause deeper earthquakes (Wei et al. 2017). Omori et al. (2004) have studied association of the distribution of dehydration events with earthquakes and found non-linear correlation between maximum depth of earthquake and temperature of the slab, with lack of deep earthquakes in young subduction-zones. Their work showed that deeper earthquakes (> 300 km) are mostly located in the selected areas along the western subduction zones of Pacific plates whereas fewer deep earthquakes occurred at the eastern boundaries of Pacific plates. The results of the current research may provide supplementary information about singularity of frequency-depth distribution of shallow earthquakes around Moho in the subductions zones of the Pacific plates. The local singularity analysis may provide a new tool for characterization and distinguishing between earthquakes from a fractal and self-similarity point of view. Further work will extend the analysis to cover more areas and other depths of earthquakes. Other sizes of earthquakes will also be considered.

### **10.9 Discussion and Conclusions**

In the first part of the chapter, the purpose of including suggestions about mathematical geosciences or geomathematics as a discipline and introduction to examples of significant contributions of mathematical geoscience scientists to science was to appeal to the public and geoscientists to appreciate the indispensable role that MG can play in the family of geosciences. In the second part of the chapter, the fractal density model was introduced and used for characterizing the power-law rheology of phase transition, and singularity analysis of earthquakes from subduction zones of Pacific plates was demonstrated to be a new and promising nonlinear MG method for modeling extreme and "avalanche" geo-events. Examples of application of singularity analysis not only include earthquakes as introduced in the current chapter but also other types of extreme events such as magmatic flare ups (Cheng 2017a), mid ocean ridge anomalous heat flow (Cheng 2016), flooding caused by tropic storms (Cheng 2008), and mineral deposits as well as ore-caused anomalies in surface media (Cheng 2007). Further comprehensive analysis of earthquakes from other regions and clustering depths will be published in separate papers.

**Acknowledgements** Thanks are due to Professor Frits Agterberg for critical review of the paper and constructive comments. I thank professor B. S. Daya Sagar and Petra van Steenbergen for allowing the extra time to complete the manuscript. Mr. Shubing Zhou is thanked for assistance on preparing the datasets. The research has been jointly supported by the National Key Technology R&D Program of China (No. 2016YFC0600501) and the State Key Program of National Natural Science of China (41430320).

### **References**


Mandelbrot BB (1983) The fractal geometry of nature. WH Freeman and Co., New York, p 495 Mandelbrot BB (1989) Multifractal measures, especially for the geophysicist. Pure Appl Geophys 131:5–42


Wilson JT (1966) Did the Atlantic close and then re-open? Nature 211:676–681


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **Part II General Applications**

### **Chapter 11 Electrofacies in Reservoir Characterization**

**John C. Davis**

**Abstract** Electrofacies are numerical combinations of petrophysical log responses that reflect specific physical and compositional characteristics of a rock interval; they are determined by multivariate procedures that include principal components analysis, cluster analysis, and discriminant analysis. As a demonstration, electrofacies were used to characterize the Amal Formation, the clastic reservoir interval in a giant oil field in Sirte Basin, Libya. Five electrofacies distinguish categories of Amal reservoir rocks, reflecting differences in grain size and intergranular cement. Electrofacies analysis guided the distribution of properties throughout the reservoir model, in spite of the difficulty of characterizing stratigraphic relationships by conventional means.

### **11.1 Introduction**

The primary responsibility for reservoir modeling is in the hands of petroleum engineers, but the most successful reservoir modeling projects have included quantitative input from geologists and geophysicists. However, geologists with the necessary mathematical and computer skills are scarce, so there has been a tendency to rely instead on commercial software that runs factory-set defaults to perform geological and petrophysical modeling, even though statistical software can readily be adapted to perform many of the operations that are useful for geological reservoir modeling. These include statistical analyses of properties derived from well logs, cores and downhole measurements and investigations to determine the best geostatistical parameters for static modeling, evaluating relative effectiveness of seismic attributes, and estimating reservoir fluid properties such as hydraulic flow units. As an example, we will consider the calculation and use of electrofacies in the characterization of a giant clastic reservoir, the Amal field of Libya.

© The Author(s) 2018

J. C. Davis (✉)

Heinemann Oil GmbH, 918 Jersey, Box 353, Baldwin City, KS 66006, USA e-mail: jdavis@h-oil.com; jcdbaldwin@gmail.com

B. S. Daya Sagar et al. (eds.), *Handbook of Mathematical Geosciences*, https://doi.org/10.1007/978-3-319-78999-6\_11

### **11.2 The Amal Field of Libya**

The first commercial discoveries of oil in the Sirte Basin of Libya were made in 1958, and in 1959 the first giant field in Libya was found in the Sirte Basin. Five more giant fields were discovered in the same year, including the Amal field discussed here. By the end of the 1960s, the Sirte Basin was established as one of the premier oil provinces of the world (Hallett 2002).

Most reservoirs in the major fields of the Sirte Basin have been in production for 50 years or more and are now nearing depletion. In an effort to extend the lives of fields, the Libyan National Oil Company (NOC) has authorized numerous reservoir studies in the hope that they will disclose previously untapped reserves or suggest improved production strategies. Fortunately, seismic, well, and production information is available for many fields, which permits detailed modeling of reservoirs and the investigation of production alternatives.

The Amal field is located on a wedge-shaped tilted fault block called the Rakb High, one of a series of elongated, subparallel horsts and grabens in the eastern part of the Sirte Basin. The primary reservoir interval is the Amal Formation, a typical transgressive clastic sediment composed of weathered material derived from the underlying basement. Most of the formation is a "tight, hard, quartzose, irregularly feldspathic sandstone" (Roberts 1970). Radiometric studies date the Amal Formation as Cambro-Ordovician to Permian, although a few Triassic fossils have been recovered from lacustrine shales within the formation. Elsewhere in Libya similar transgressive basal sandstones overlying the Hercynian unconformity are called the "Nubian Sandstone" and assigned a Lower Cretaceous age (El-Hawat et al. 1996). The Amal clastics were deposited in continental environments, with some small irregular intervals of possibly lacustrine and shallow marine origin. Thin volcanic sills and flows of Permian age also occur sporadically in the formation, as do local unconformities. The Amal is present everywhere on the Rakb High except at the south end of the uplift where it has been removed by erosion.

### **11.3 Electrofacies Analysis**

"Electrofacies" are unique combinations of petrophysical log responses that reflect specific physical and compositional characteristics of a rock interval cut by a borehole. The term "electrofacies" was coined by Serra and Abbot (1980), who considered electrofacies to be proxies for lithofacies. An important advantage of electrofacies over alternative types of facies classifications of rocks in the subsurface is that electrofacies can be defined solely on the basis of well log responses, without reliance on cores, cuttings or outcrops. Although electrofacies are empirical, they are also objective; no subjective interpretations of sediment genesis or inferences about depositional environments are required.

There is no specific procedure for defining electrofacies. The general requirements are that they be determined from a consistent set of petrophysical log measurements; that the similarities between down-hole intervals are expressed quantitatively from the log responses; that the intervals are consistently divided into subsets that have similar responses; and that the distinctions between subsets are expressed as mathematical functions. Because of the enormous amount of data contained in the log suites from a collection of wells, it is necessary that electrofacies be determined by computer (Kiaei et al. 2015). This introduces the practical requirement that electrofacies be defined by a programmable algorithm.

Many procedures for determining electrofacies have been proposed in the literature (Berteig et al. 1985; Busch et al. 1987; Delfiner et al. 1987; Tetzlaff et al. 1989; Anxionnaz et al. 1990; Hernandez–Martinez et al. 2013; Euzen and Power 2014) and most commercial software packages for subsurface modeling have electrofacies functions. Unfortunately, details about how these functions perform are seldom revealed, and the procedures operate as "black boxes." (Exceptions are the description of Schlumberger's FACIOLOG procedure given by Wolff and Pelissier-Combescure 1982, and the software provided by Lee et al. 2002). Almost all commercial implementations consist of a combination of principal components analysis, cluster analysis, and discriminant analysis. These underlying methodologies can be duplicated using a multivariate statistical package, which has the advantages of flexibility and transparency, although perhaps less convenient for routine electrofacies calculations. Dubois et al. (2007) provide a comparison of alternative statistical methodologies for electrofacies analyses. Perez et al. (2005) have demonstrated that electrofacies are superior to other types of reservoir characterizations such as lithofacies or hydraulic flow units (HFU).

The general definition of "facies" is "the aspect, appearance, and characteristics of a rock unit, usually reflecting the conditions of its origin; especially as differentiating the unit from adjacent or associated units" (Neuendorf et al. 2005). The definition continues to more specialized varieties of facies, noting that "sedimentary facies" consist of a restricted part of a lithostratigraphic body with a unique lithology or fossil content, or a certain environment or mode of origin such as "red-bed facies." A "petrographic facies" is a body of rock of a distinctive lithology, while a "biofacies" contains a unique assemblage of fossil organisms. "Environmental facies" consist of a body of rock formed in a specific environmental setting, such as a "fluvial facies" or a "near-shore facies." The term "facies" may also refer to rocks defined on a paleogeographic or paleotectonic basis, such as a "geosynclinal facies" or a "continental margin facies."

Note that all of these definitions require either information that can only be obtained from direct observation of the rocks themselves (lithologies, fossils), or subjective interpretations about the origins or depositional environments in which the rocks were formed. In contrast, electrofacies are based solely on the "…aspect, appearance, and characteristics…" of petrophysical logs, and not of the rocks which the logs represent. The basic assumption in electrofacies interpretation is that a unique combination of log properties represents a rock that exhibits a unique combination of physical properties—in other words, the rock is unique in terms of its composition and fluid content.

### *11.3.1 Choice of Log Traces for Electrofacies Calculation*

Ideally, there will be a large suite of logs available for calculating electrofacies and the tool responses to be used can be chosen based on resolution and response to properties of primary interest. In practice, especially in areas where drilling and logging has taken place over many years, finding a common set of logs that is available in all (or most) wells severely limits the choice. In the electrofacies study discussed here, only the DT, GR and ILD logs were common to all wells in the field. However, by removing a small number of wells from consideration, the suite of logs could be expanded to include the SN and SP logs.

### *11.3.2 Standardization of Log Traces*

It is essential that the log measurements used in electrofacies calculations be consistent throughout the stratigraphic section in the well being analyzed, and from one well to another. This can be done in a variety of ways. Some commercial programs such as Schlumberger's *Petrel* do this by converting the data into principal component scores and then computing electrofacies from scores rather than from the log data itself. Although principal components were calculated here for display purposes, we prefer to compute electrofacies directly from the original log variables after appropriate transformations.

Log standardization consists of subtracting the mean log response over an interval of interest from every log reading in the interval and dividing the remainder by the standard deviation of the response in the interval. This converts the reading into dimensionless units of standard deviation, most of which will range in value from –3 to +3 (Davis 2002). Each log trace is standardized independently of all other log traces in a well, and the traces in each well are standardized independently of all other wells. This (1) removes any effects caused by differences in measurement units (ohm-meters, millivolts, microseconds/ft, etc.). It also insures (2) that all logs used in the analysis equally influence the classification of the electrofacies because all the logs have the same average value (their means are all 0.0) and their spreads in values are approximately the same (their standard deviations are all equal to 1.0). Furthermore, (3) any differences between wells caused by different hole conditions or different logging parameters are removed. In petrophysical terms, standardization of the log tracks for individual wells can be regarded as an ultimate form of well log normalization.

We can regard the transformed well log data as consisting of a matrix or flat file whose columns contain the standardized well log traces and whose rows are measured depths or elevations in specific wells. Further computations are done treating the row vectors as individual multivariate "objects" to be classified.

### *11.3.3 Estimating the Number of Distinct Electrofacies*

Because electrofacies are defined empirically, the number of different electrofacies is somewhat arbitrary. The number of useful electrofacies is partly dependent on the number of log properties used in their calculation and the joint nature of the statistical distributions of the log measurements. It also reflects the purpose of electrofacies classification and the manner in which the final classification will be evaluated and used. A simple distinction between reservoir and non-reservoir rock may be made with an electrofacies classification of only two classes, while a study for environmental interpretation may require a dozen or more classes.

Because there is a limited number of well logs that measure different physical properties in the example used here, we anticipate that an effective electrofacies interpretation will not involve many facies classes. Determining the appropriate number requires trial-and-error, starting with many classes and reducing the number to eliminate trivial categories that include only a few rare observations, or to combine ill-defined classes that have very similar properties. The same trial-and-error process can be used to evaluate alternative procedures such as different clustering algorithms.

Figure 11.1 is a cross-plot of the first and second principal components of log responses from the Amal Formation. The scatter diagram represents 12,535 well log observations classified into seven electrofacies; each electrofacies category is indicated by a color (1 = red; 2 = green; 3 = blue; 4 = orange; 5 = light blue; 6 = purple; 7 = yellow). Categories 3 and 4 are relatively small and consist of scattered observations located on the periphery of the main cloud of observations; a classification with fewer categories might be better. The classification procedure was repeated with six categories, then with five, and finally with only four. Five electrofacies seemed to be an optimal compromise in which the facies are general enough to include significant thicknesses of intervals, but not so detailed that they defy interpretation (Fig. 11.2). The distribution of observations among the five classes is shown in a principal component scatter plot in Fig. 11.3.

### *11.3.4 Assigning Well Log Intervals to Electrofacies*

There are two basic approaches to the assignment of log intervals to electrofacies, referred to generally as *supervised* and *unsupervised classification*. The first requires prior definition of the facies categories, which is usually done by

**Fig. 11.1** Cross plot of first two principal component scores of GR, DT, ILD, SN and SP log responses from Amal Formation in 15 wells of the Amal field, Libya. Points are color coded to represent seven electrofacies calculated by k-means cluster analysis

identifying unique lithologies in cores. The log traces for the corresponding intervals are then used as a training set for discriminant analysis or another classification procedure that yields equations used to discriminate between the facies in uncored intervals. Although this approach has the advantage that interpreting the "meaning" of the electrofacies categories is obvious, it has a severe disadvantage in that cores or other training materials are required. An example of a supervised electrofacies classification is given by Barthelmy (2000), who classified 360,000 feet of log from the Smackover Formation in 364 North American wells, using 47,000 feet of core as training material. In the Amal field, very few cores have been taken and not all the rock types in the Amal Formation have been sampled in a representative manner.

If adequate training materials are not available, the analyst must resort to unsupervised classification. This involves subdividing the set of log measurements into subsets that are as unique as possible in their log characteristics, and as distinct as possible from other subsets. There are many procedures that attempt to achieve this objective—their effectiveness depends on the statistical distributions of the petrophysical logs that are used.

The classification procedure used in this study is *k*-*means clustering*, which assigns each observation (a row vector in the data set) to the "nearest" cluster based on the multidimensional distance between the observation and the cluster centroid. The multivariate Euclidian distance, *dij*, between an observation and a cluster centroid is

$$d\_{\vec{y}} = \sqrt{\frac{\sum\_{p=1}^{q} \left(z\_{ip} - \vec{Z}\_{jp}\right)^2}{q}}$$

where *zip* is the standardized response of log track *p* at a well depth *i* and *Z*̄ *jp* is the average response of log *p* in cluster *j*. There are *q* different standardized log traces per observation.

The *k*-means method first selects a set of *k* points called *cluster seeds* as a first guess at the means of the clusters. Each observation is assigned to the nearest seed to form a set of temporary clusters. The seeds are then replaced by the cluster means, the points are reassigned, and the process continues until no further changes occur in the clusters. The *k*-means approach is a special case of a general approach called the *EM algorithm* (Dempster et al. 1977), where *E* stands for *Expectation* (the cluster means in this implementation) and the *M* stands for *maximization*, which is the assignment of observations to the closest clusters in this implementation. The algorithm will produce maximum likelihood estimates of the probability that a log reading belongs to a specific electrofacies. The procedure is widely used in computer vision and portfolio management, in addition to electrofacies

**Fig. 11.2** Histograms of the number of log readings in each electrofacies class in 15 wells of the Amal field, Libya. **a** Categorized into seven electrofacies classes. **b** Categorized into five electrofacies classes

**Fig. 11.3** Cross plot of first two principal component scores of log responses from Amal Formation in 15 wells of the Amal field, Libya. Points are color coded to represent five electrofacies calculated by k-means cluster analysis

classification. Fifty-one iterations were required by the *k*-means algorithm to converge on a stable five-cluster configuration of the 12,535 log responses used here.

### *11.3.5 Converting the Electrofacies Classification into a Prediction Function*

Although the *k*-means clustering algorithm can successfully classify a collection of log responses into an arbitrary number of electrofacies, it does not produce a posterior classifier. That is, it does not create a classification rule or mathematical function that can be used to assign additional log readings to the electrofacies categories it has found. An additional step is necessary.

Canonical discriminant analysis can be used to find a set of linear functions that will separate all possible pairs of electrofacies clusters—in effect, dividing up multivariate space so only one electrofacies occupies each partitioned cell. The computations involve dividing the variance-covariance matrix of the five log properties into components that represent the variation of each observation around the grand mean, the variation of each observation around its electrofacies group mean, and the variation of the electrofacies means around the grand mean. Computational details are given in Davis (2002). Mulhern et al. (1986) discuss the application of discriminant functions to electrofacies determination.

In discriminant analysis, the distance from a log reading to the multivariate mean of the *i*-th electrofacies group is the Mahalanobis distance, *D*<sup>2</sup> , and is computed as

$$D^2 = (z - \bar{Z}\_i)^\prime \mathbf{S}^{-1} (z - \bar{Z}\_i) = z^\prime \mathbf{S}^{-1} z - 2z^\prime \mathbf{S}^{-1} \bar{Z}\_i + \bar{Z}\_i^\prime \mathbf{S}^{-1} \bar{Z}\_i$$

where **S** is the covariance matrix. The distance is divided into a portion, *dist*[0], that does not vary across groups and a portion that is the Mahalanobis distance of an observation from the centroid of the *i*-th electrofacies, *dist*[*i*]:

$$\begin{aligned} dist[0] &= \boldsymbol{z'} \mathbf{S}^{-1} \\ dist[i] &= dist[0] - 2\boldsymbol{z'} \mathbf{S}^{-1} \bar{Z}\_i + \bar{Z}\_i' \end{aligned}$$

Assuming that each group follows a multivariate normal distribution, the posterior probability that a well log interval belongs to the *i*th electrofacies is

$$\Pr[i] = \frac{\exp\left(\text{dist}\left[i\right]\right)}{\Pr[0]}$$

where

$$\Pr[\mathbf{0}] = \sum e^{-0.5dist[\hat{i}]}$$

The distances from every log observation to each electrofacies centroid is first calculated, then turned into probabilities. Each observation is then assigned to the electrofacies to which its probability of membership is the highest. Observations from other wells can also be assigned electrofacies by entering their standardized measurements into the distance and probability equations.

The assignment of individual well log observations to electrofacies by canonical discriminant analysis is not perfect, primarily because of overlapping of the original clusters. This can be evaluated by comparing the original electrofacies assignments from clustering to the results of discrimination. Figure 11.4 shows the first two principal components for 12,535 log readings in the Amal Formation in 15 wells. The points have been color-coded according to the maximum probability assignment of electrofacies by the canonical discriminant function. Compare this illustration to the original electrofacies assignments in Fig. 11.3. Contingency analysis shows that the overall correct classification rate is approximately 89%. Correct classification rates for individual electrofacies groups ranges from a low of 93.1% to a high of 97.9%.

However, the primary motivation for introducing a discrimination step in electrofacies analysis is to create numerical expressions that can be used to classify intervals in wells that were not included in the original clustering. This may be necessary if it is not possible to cluster all observations (that is, all depth intervals of

**Fig. 11.4** Cross plot of first two principal component scores of standardized log responses from Amal Formation in 15 wells of the Amal field, Libya. Points are color coded to represent maximum probability assignment into five electrofacies classes

interest in all wells) because of computer or software limitations. (A large oil field may include millions of log measurements, so such limitations may significantly constrain an electrofacies study.) Fortunately, in the Amal study it was possible to perform cluster analyses using all of the data of interest, so a discrimination step could be avoided. This not only simplifies the procedure, but also results in a slight but significant improvement in electrofacies classification.

### **11.4 What Do Amal Electrofacies Mean?**

An empirical interpretation of Amal electrofacies has been made by comparing the electrofacies classifications to core descriptions for a set of wells in which extensive sets of cores were taken. The interpretations are necessarily somewhat ambiguous because of the circumstance mentioned in the preceding paragraph, and because the core descriptions were written by different geologists who may have emphasized different aspects of the rock or who used different definitions of their descriptive terms. The following lithologic descriptions represent an amalgam of the written words assigned to numerous intervals in different wells where the Amal has been given the same electrofacies classification. The lithologic distinction between Amal electrofacies is especially difficult because almost all of the formation is composed of sandstones and conglomerates of varying grain size but similar composition.

### *11.4.1 Lithologic Description of Amal Electrofacies*

Electrofacies 1 = Quartz sandstone with abundant kaolinite cement, traces of chlorite, mica and/or feldspar, very fine to medium grain size, subangular, medium to well sorted.

Electrofacies 2 = Quartz sandstone with kaolinite cement, common biotite, very thin bedded and/or crossbedded, silt to fine grain size, subangular, medium sorted. Electrofacies 3 = Quartz conglomerate with kaolinite and/or anhydrite cement, very fine to very coarse grain size with large (>1 inch) rounded quartz pebbles, round to subround grains, unsorted. Also, quartz sandstone with silica cement, common biotite and/or hematite, silt to coarse grained, alternating sorted and unsorted layers, round to subround, no visible porosity, hard.

Electrofacies 4 = Quartz sandstone with minor kaolinite cement, traces of chlorite, mica and/or feldspar, silt to medium grain size, subangular to subround, medium sorted.

Electrofacies 5 = Igneous rock, weathered, microcrystalline to acicular, with muscovite mica and/or feldspar phenocrysts.

The lithologies corresponding to Amal electrofacies perhaps can best be understood in terms of two-way variation (Fig. 11.5). Along one axis, the electrofacies represent differences in grain size and sorting; along the other axis the electrofacies reflect the nature of the intergranular cement in the sandstone, which tends to be either kaolinite (occasionally calcite or anhydrite) or silica. Kaolinite probably has resulted from the decay of feldspar grains in what was originally an arkosic sandstone. Silica cement probably is the result of pressure solution of quartz grains and redeposition.

### **11.5 Conclusions**

Electrofacies have proved to be a useful procedure for identifying and distinguishing intervals with similar petrophysical log responses and approximately equivalent lithologies within a formation that is nearly homogeneous in composition and devoid of biostratigraphic indicators or marker beds. Because the Amal Formation was mostly deposited in a terrestrial environment, facies change rapidly both laterally and vertically and conventional lithostratigraphic correlations cannot be made. Electrofacies analysis provides a framework for modeling that can guide the distribution of reservoir properties throughout the model, in spite of the

difficulty of characterizing stratigraphic relationships by conventional means. This is one example of the type of contributions that can be made to reservoir modeling by geoscientists using a quantitative approach.

**Acknowledgements** Data from the Amal field were provided by the operator, Harouge Oil Operations of Tripoli, Libya. Electrofacies analyses were part of a much larger study of the reservoir by Heinemann Oil GmbH, Leoben, Austria. The assistance of the HOL staff, especially Stephan Egger, is gratefully acknowledged.

### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **Chapter 12 Shoreline Extrapolations**

**Jean Serra**

**Abstract** A morphological approach for studying coast lines time variations is proposed. It is based on interpolations and forecasts by means of weighted median sets, which allow to average the shorelines at different times. After a first translation invariant method, two variants are proposed. The first one enhances the space contrasts by multiplying the quench function, the other introduces homotopic constraints for preserving the topology of the shore (gulfs, islands).

**Keywords** Median sets ⋅ Binary interpolation ⋅ Hausdorff distances ⋅ Shoreline Time forecasting

### **12.1 Three Problems, One Theoretical Tool**

The following study holds on lagoon inlets movements. It extends and develops an experimental study made by N.V. Thao and X. Chen about Thuan An Inlet Area (Thao and Chen 2005). The predictions proposed by these authors were obtained by averaging over the time the successive positions of a complex shoreline, including lagoon inlets, which results in a prediction of the coast line. J. Chaussard showed, in Chaussard (2006), that this prediction correctly fits with ulterior data from Google Earth (see Fig. 12.1).

In Thao and Chen (2005), the authors used a popular way to estimate accretions (Srivastava et al. 2005). Figure 12.2 depicts this semi-manual approach: the shoreline has been discretized into segments which are shifted upwards according a given accretion law (here the linear law *y* = *ax* + *b*, where *x* stands for the time). Indeed, this is nothing but a sampled version of the dilation the shoreline by the disc of radius *ax* + *b*. Such a circular dilation of a shoreline turns out to be the simplest expression of its evolution under an accretion process, since it is uniform everywhere and does not take the previous stages of the shoreline into account. As a matter of fact, the

J. Serra (✉)

Ecole des Mines de Paris, Paris, France e-mail: jean.serra@cmm.ensmp.fr

© The Author(s) 2018

B. S. Daya Sagar et al. (eds.), *Handbook of Mathematical Geosciences*, https://doi.org/10.1007/978-3-319-78999-6\_12

**Fig. 12.1** Left: Lagoon Inlets forecast by N.V. Thao and X. Chen; right: Current Google earth view of the same area

**Fig. 12.2** Classical semi-manual technique of extrapolation

notion of a set extrapolation is not straightforward, and depends considerably on the features one wishes to preserve or to emphasize.


<sup>1</sup>The shoreline context, the two words of "erosion" and "accretion" refer to the two types of changes depicted in Fig. 12.3. The word "erosion" also appears in the context of mathematical morphology, for naming the operation *⊖* involved in Eq. 12.1. It is pure coincidence.

3. if the shore exhibits small gulfs, islands and lagoon lakes, we may require from the extrapolation to preserve their homotopy, i.e. neither to create new islands (new gulfs, new lakes) nor to suppress the existing ones.

The first two questions can be treated within the framework of the median set theory, and the third one reduces to a small variant. Though median elements were thoroughly studied for interpolation problems, by M. Iwanowski in particular Iwanowski and Serra (2000) no attention was paid to their potentialities for generating averages and extrapolations. We believe nevertheless median sets turn out to be convenient tools for shorelines forecast, which in addition extend directly to numerical functions (however, we shall not treat the numerical extension here, and restrict ourself to the binary approach).

What follows is an attempt in this direction. After a presentation of the median set, that we adapt to shorelines in Sect. 12.2, we analyze in Sect. 12.3 a series of derived notions, such as weighted median set, quench function and quench stripe, and averages. The heart of the matter is treated in Sect. 12.4, where various laws are proposed for the dynamics of the coast movements. A short section on homotopy preservation precedes the conclusion. All images of coasts which are used below are *simulations*, and have the same digital size of 512 × 320 pixels.

### **12.2 Median Set**

In literature, median set appears as an interpolation algorithm in Casas (1996) and in Meyer (1996), and was extended to partitions in Beucher (1998). Its formal definition and its basic properties were given in Serra (1998). Since, the approach has been developed by several authors (Angulo and Meyer 2009; Charpiat et al. 2006). In what follows, the geographical space is modelled by the Euclidean plane, but the approach applies as well to any metric space, including the digital ones. The model of Euclidean median sets does not concern the *lines* of the shores, but the *whole landsets*, whose the shorelines are the boundaries. These landsets, denoted below by *A*1*, A*2, etc., are depicted for example in Fig. 12.3 left, whereas the only shorelines boundaries, in another example, are depicted in Fig. 12.5 left. The basic results we need to start with are the Definition 1 of a median set, and the two properties 2 and 3, drawn from Serra (1998).

Hausdorff distance concerns the class *K* ′ of the noncompact sets of *Rn* (here of *R*<sup>2</sup>). It is the mapping ∶ *K* ′ × *K* ′ → *R*<sup>+</sup>

$$\rho(X,\ Y) = \inf \{ \lambda \, : \, X \subseteq Y \oplus \lambda B \, ; \, Y \subseteq X \oplus \lambda B \} \tag{12.1}$$

where *B* designates the unit disc centered at the origin, and where *⊕* and *⊖* designate Minkowski addition (or dilation) and substraction (or erosion) respectively.

Consider now an *ordered* pair of closed sets {*X, Y*}, with *X ⊆ Y*, and such that the numerical value (*X, Y*), as given by Eq. (12.1), is finite. Their median element is defined as follows:

**Definition 1** The median element between the two ordered sets *X, Y*∈ *K* ′ , with *X ⊆ Y*, is the compact set *M*(*X, Y*), comprised between *X* and *Y* and whose boundary points are equidistant from *X* and *Y<sup>c</sup>* .

In other words, the boundary *M* of *M* is nothing but the skeleton by zone of influence, or *skiz*, between *X* and *Y<sup>c</sup>* .

**Proposition 1** *The median set between X and Y is obtained by taking the union*

$$M(X,Y) = \cup \{ (X \oplus \lambda B) \mid \cap (Y \ominus \lambda B) \text{ } \lambda \ge 0 \}\tag{12.2}$$

*where the can be limited to the values smaller or equal to*

$$\mu = \inf \{ \lambda \; ; \; \lambda \ge 0, \; X \oplus \lambda B \supseteq Y \ominus \lambda B \} \tag{12.3}$$

*and where the equality is reached for at least one point of M.*

*Proof* A point *m* at a distance ≤ from *X* and ≥ from *Y<sup>c</sup>* belongs to set (*X ⊕ B*) ∩ (*Y ⊖ B*), hence to set of Eq. (12.1). Conversely, as every point *m* ∈ *M* belongs to at least one term of the union, there exists a ≥ 0 with *d*(*m, X*) ≤ and *d*(*m, Y<sup>c</sup>*) ≥ , which results in Eq. (12.1). As for Eq. (12.2), we observe that for large enough we have (*X ⊕ B*)∪(*Y<sup>c</sup> ⊕ B*) = *R*<sup>2</sup> because set *Y* is bounded. These bring no contribution to set *M*(*X, Y*), since *X ⊕ B ⊇ Y ⊖ B*. Finally, for = , we obtain a point of the boundary *M* because *X* and *Y* are closed, which achieves the proof.

Here is now an instructive property which shows how both Hausdorff distances by dilation and by erosion<sup>2</sup> are involved in the median *M*(*X, Y*) (Serra 1998).

**Proposition 2** *Given X, Y* ∈ *K* ′ (*Rn*)*, the median element M*(*X, Y*) *is at Hausdorff dilation distance from X and from the closing X* ∙ *B* = (*X ⊕ B*) *⊖ B, and at Hausdorff erosion distance from Y and from the opening Y*o*B* = (*Y ⊖ B*) *⊕ B.*

$$\sigma(X,\ Y) = \inf \{ \lambda \, : \, X \ominus B\_{\lambda} \subseteq Y \; ; \, \, Y \ominus B\_{\lambda} \subseteq X \},$$

$$Y \supsetneq \bigcup\_{\lambda \ge 0} X \ominus B\_{\lambda} = X^{\diamond} \implies Y \supsetneq \overline{X^{\diamond}} = X \qquad \quad X, \ Y \in \omega^{\diamond}$$

and similarly *X ⊇ Y*, hence*X* = *Y* (the other two axioms are proved as for distance ) (Serra 1998).

<sup>2</sup>Hausdorff distance for erosion, introduced in by the relation

concerns the subclass *A* of *K* ′ (*E*) of the regular compact sets, i.e. such that *X*<sup>o</sup> = *X*. It is indeed a distance on *A* × *A* . If (*X, Y*)=0, then we have

**Fig. 12.3** Left: two simulated shore images *A*<sup>1</sup> and *A*2. The older is supposed to be *A*<sup>1</sup> (the white one).The zones of accretion from *A*<sup>1</sup> to *A*<sup>2</sup> are in light grey, those of erosion in dark grey; right: the boundary of the median set *M* between *A*<sup>1</sup> and *A*<sup>2</sup>

The Hausdorff distance applies to non empty compact sets. But clearly, the landsets under study are not empty, and the above assumption that (*X, Y*) *<* ∞ comes back to say that all involved distances are bounded.

### **12.3 Median and Average for Non Ordered Sets**

**Non ordered sets** In general, two successive shores *A*<sup>1</sup> and *A*<sup>2</sup> are not ordered, i.e. their change comprises both erosions and accretion areas. If so, the previous results do not apply to two *A*<sup>1</sup> and *A*<sup>2</sup> directly, but to their intersection *X* = *A*<sup>1</sup> ∩ *A*<sup>2</sup> and their union *Y* = *A*<sup>1</sup> ∪ *A*<sup>2</sup> which are ordered since *X ⊆ Y*. Equation (12.1) of the median element becomes

$$M(A\_1, A\_2) = \bigcup\_{\lambda \ge 0} [A\_1 \cap A\_2) \oplus \lambda B] \cap [(A\_1 \cup A\_2) \oplus \lambda B] \tag{12.4}$$

Figures 12.3 depicts an example of median set *M*. One observes that *M* goes through all points where the two coastlines intersect. The property is general, since these points belong to both *A*<sup>1</sup> ∩ *A*<sup>2</sup> and *A*<sup>1</sup> ∪ *A*2.

**Weighted median** Set *M* is said to be *median* because each point of *M* is equidistant from *X* and *Y<sup>c</sup>*, which is a consequence of the same weight given to dilation and erosion in Eq. (12.2). By changing this weight, i.e. by replacing *M* by

$$M\_a(X,Y) = \bigcup\_{\lambda} \{ (X \oplus a\lambda B) \cap (Y \oplus (1-a)\lambda B) \} \tag{12.5}$$

for a ∈ [0*,* 1], we generate another interpolation, and by making vary, a series of progressive interpolations from *X* to *Y* (Huttenlocher 1995), all the closer to set *Y* since is high. One will notice that when the two shores *A*<sup>1</sup> and *A*<sup>2</sup> are *not* nested in each other, then one takes for the two operands of Eq. (12.5) *X* = *A*<sup>1</sup> ∩ *A*<sup>2</sup> and *Y* = *A*<sup>1</sup> ∪ *A*2. This provides interpolators such as those of Fig. 12.4. Unfortunately, these interpolators are closer to the highest or to the lowest line, no matter these lines are portions of *A*<sup>1</sup> or of *A*2. For correcting this drawback, one must take the interpolator *M* in the zones where *A*<sup>1</sup> is larger than *A*<sup>2</sup> (for example), and *M*1− in the other ones. Denoting by *N*(*A*1*, A*2) the correct weighted interpolator, we now have

$$N\_a(A\_1, A\_2) = M\_{1-a}(A\_1, A\_2) \,\,\,when \,\, A\_1 \,\, \_{A\_2}^{A} \neq \emptyset \,\,\,\tag{12.6}$$

$$= M\_a(A\_1, A\_2) \,\,when \,\, A\_2 \,\, \_{A\_1}^{A} \neq \emptyset$$

Figure 12.5 depicts such corrected interpolators.

**The physical equation of the phenomenon** Physically speaking, the accretion/ erosion process evolves at each instant from the stage it has reached before. It takes some *M*(*X, Y*), with ∈ [0*,* 1], as starting point and moves to *M* [*M* (*X, Y*)*, Y*], for some value ∈ [0*,* 1]. The weighted medians *M* do model this evolution because they form a semi-group. By calculating firstly the set *M*(*X, Y*) median between *X* and *Y*, and then the set *M* [*M*(*X, Y*)*, Y*] between *M*(*X, Y*) and *Y*, we obtain indeed the same result as by calculating directly *M* (*X, Y*) for the weight = + (1 − ) = + − , i.e.

**Fig. 12.4** Raw weighted median lines

**Fig. 12.5** Left: two shores *A*<sup>1</sup> and *A*2, of boundaries *A*<sup>1</sup> and *A*2, and their median line of boundary *M*0*.*5; right: the same, plus two additional weighted median lines according to Eq. (12.5)

$$M\_{\beta}[M\_a(X,\ Y),\ Y] = M\_{a+\beta-a\beta}(X,\ Y) \tag{12.7}$$

For example, in Fig. 12.5 right, the three median sets correspond to = 0*.*75*,* 0*.*5, and 0.25, and the weighted median *M*0*.*<sup>75</sup> is *also* the median element between *M*0*.*<sup>5</sup> and *A*<sup>1</sup> ∪ *A*2.

**Proposition 3** *Given X, Y* ∈ *K* ′ (*Rn*)*, the family* {*M*(*X, Y*)*,* 0 ≤ ≤ 1} *of median elements form an additive semi-group for the addition ⊗* = + − *.*

*Proof* Clearly, *⊗* ∈ [0*,* 1], *thus Eq.* (12.7) *defines a commutative semi-group. The operation ⊗ is also associative, since*

$$\gamma \otimes (a + \beta - a\beta) = \gamma + a + \beta - a\beta - \gamma a - \gamma \beta + \gamma a \beta$$

*is symmetrical in , , , therefore ⊗ is an algebraic addition.*

**Quench function and quench stripe** As a matter of fact, the median operator provides*two outputs*, since we have on the one hand the (weighted or not) *median set M*, whose contour *M* is the dark middle line in Fig. 12.5 left, or Fig. 12.6 left, and the *quench function q*, defined on *M* and which gives at each the radius of the minimum disc hitting the two contours *A*<sup>1</sup> and *A*2.

$$q(z) = \inf\left\{ r \; ; \; B\_z(r) \cap \partial A\_1 \neq \emptyset \; and \; B\_z(r) \cap \partial A\_2 \neq \emptyset \right\} \tag{12.8}$$

A few of such discs, for the two inputs *A*<sup>1</sup> and *A*<sup>2</sup> of Fig. 12.3 left, are depicted in Fig. 12.6 left, and their union for the whole quench function gives the *quench stripe w*, i.e. the dark grey stripe *W* around the black line *M* in Fig. 12.6 right, with

$$W = \cup \{ B\_{\underline{z}}(q(\underline{z})), \; z \in M(A\_1, A\_2) \} \tag{12.9}$$

Note hat this dark grey stripe does not reach the edges of input sets *A*<sup>1</sup> and *A*2, but an open version of their union, and a closed version of their intersection.

**Fig. 12.6** Left: a few maximum discs centered on the median line; right: the dark grey stripe is the union of all maximum discs, or "quench stripe"

**Fig. 12.7** Left: four shores; right in dark, their median line

**Averages** The structure of Eq. (12.7) suggests a technique for extending the median element to more than two input sets. Starting for example from the triplet {*A*1*, A*2*, A*3}, we can calculate *M*0*.*5(*A*1*, A*2) in a first stage, and then *M*0*.*33[*M*0*.*5(*A*1*, A*2)*, A*3]. The resulting median element averages the three inputs, in a median sense. Figure 12.7 depicts an example of such an average for the four inputs {*A*1*, ..A*4} shown in Fig. 12.7 left (two of them are the sets involved in Fig. 12.5 left). The initial stage consists in calculating *M*0*.*5(*A*1*, A*2) and *M*0*.*5(*A*3*, A*4), and the final one in calculating *M*0*.*5[*M*0*.*5(*A*1*, A*2)*, M*0*.*5(*A*3*, A*4)], a set whose contour is drawn in black in Fig. 12.7 right. This final result is independent of the choice of the sets in the initial stage, and we could start as well from *M*0*.*5(*A*1*, A*3) and *M*0*.*5(*A*2*, A*4).

The averages obtained this way blur the structural features of the shores. Imagine for example that *A*2*, An* are shifted versions of *A*<sup>1</sup> in the horizontal direction. As *n* increases, the median average contour tends towards an horizontal line: all features, gulfs, capes, etc. are lost. We meet here the same trouble as in interpolating moving objects, with translation and rotation. In case of shore movements, the translations are probably less intense, but the problem still remains. Remark also that this drawback is the counterpart of the advantage of preserving accretion and erosion zones.

### **12.4 Extrapolations via the Quench Function**

In this section and the next one, we focus on the extrapolation of two shores at most, *A*<sup>1</sup> and *A*<sup>2</sup> say. If we dispose of a chronological sequence of the coast movements, *A*<sup>1</sup> and *A*<sup>2</sup> stand for the last two observations, *A*<sup>2</sup> being the more recent. The principle of the extrapolation consists in two possible changes:


**Fig. 12.8** Two extrapolations of the shoreline of Fig. 12.3; both are centered on *M*0*.*5(*A*1*, A*2); the quench function is multiplied by 2 in the left image and by 3 in the right one

Fig. 12.8 depicts two extrapolations where the median element equals *M*0*.*5(*A*1*, A*2), hence where the two input shores are given the same importance, but where the quench stripe *W* of Eq. (12.9) is replaced by

$$W = \cup \{ B\_z(kq(z)), \ z \in M\_{0.5}(A\_1, A\_2) \}$$

The radius of the disc centered at each point of *M*0*.*5(*A*1*, A*2) is quench value multiplied by factor *k*, with *k* = 2 for Fig. 12.8 left and *k* = 3 for Fig. 12.8 right. We see that, as *k* increases, both accretion and erosion zones are developed. We can also notice that the shape of the cape provokes a bizarre inflation in Fig. 12.8 right.

This swelling may be due to the great distance from the median line to extremity of the cape, as shown in Fig. 12.6 right, so that we can try to avoid it by making the median line closer to contour *A*<sup>2</sup> which delineates the cape. Replace then the median set *M*0*.*5(*A*1*, A*2) by *N*(*A*1*, A*2), in the sense of Eq. (12.6), with = 0*.*75, so that the quench stripe becomes

$$W = \cup \{ B\_z(kq(z)), \ z \in N\_{0.75}(A\_1, A\_2) \}.$$

The resulting changes are depicted in Fig. 12.9, left for *k* = 3, and right for *k* = 4. By comparing Figs. 12.8 right and 12.9 left where the quench function is multiplied by the same value *k* = 3, we see that the cape inflates less, but in compensation the erosion zone vanished. The erosion can reappear by taking *k* = 4 (Fig. 12.9 right), but again the cape inflates as strongly as in the previous extrapolation of Fig. 12.8 right.

In fact, transforming a quench function according to pure magnification is probably too poor. One can easily imagine more sophisticated laws such as the two following ones:

1. the median line is slightly moved toward the second contour, by taking *N*0*.*<sup>66</sup> (*A*1*, A*2), and the quench stripe *W* is obtained by dilating each point *z* of the median line by the disc of radius 2*q*(*z*) and by the segment *L*(2*q*(*z*)) of length 2*q*(*z*) in the main direction of the cape, which gives

**Fig. 12.9** Two other extrapolations of the shoreline of Fig. 12.3; both are centered on *N*0*.*75(*A*1*, A*2); the quench function is multiplied by 3 in the left image and by 4 in the right one

**Fig. 12.10** Two extrapolations of the shoreline of Fig. 12.3, by emphazising the new capes in the left image, and by introducing an east-west trend in the right one

$$W = \cup \{ [B\_z(\mathcal{Q}q(z) \oplus L\_a(\mathcal{Q}q(z))], \ z \in N\_{0.66}(A\_1, A\_2) \} \}$$

and which is depicted in Fig. 12.10 left. The accretion around the cape turns out to be now more realistic, but the erosion zone has disappeared.

2. The median set *N*0*.*66(*A*1*, A*2) is left unchanged, and a supplementary trend in the horizontal direction is introduced by a dilating points *z* by the horizontal segment *L*0(3*q*(*z*)). For avoiding too fast changes, the parameters of the two other dilations are divided by 2. The shifting effect of the trend operation appears clearly in Fig. 12.10 right, where the accretion forms a deposit at the east of the cape. Similarly, the directional effect of the erosion holds for west oriented regions.

Unlike the previous models, which all are invariant under rotation of the map, these last two laws, which model marine currents, depend on the North direction (see Fig. 12.10).

### **12.5 Accretion and Homotopy**

It may happen that, for some reasons, one wishes to preserve the homotopy of the shore, which excludes the creation, or the suppression, of lakes and islands. Now, by dilating enough the shore of Fig. 12.10, we risk to close the gulf on the left and to generate in internal island. An easy way to protect the gulf as such consists in replacing the dilation w.r.t. the unit disc by a cycle of elementary homotopic thickenings in the eight directions of the square grid, or the six ones in the hexagonal case (Serra 1982). The circular dilation of size *n* becomes the series of *n* thickening cycles. One can see in Figure ll, left and right, the results of two thickenings of sizes 25 and 33 respectively (for a 512 × 320 digital image). The gulf is preserved by a narrow channel, which could be enlarged by modifying the homotopy preservation algorithm. This conceptually simple method is not the only possible one. In Vidal et al. (2005) the authors propose a median set based interpolation that preserves particles by marking them by a homotopic thinning, and translating them during the interpolation process.

### **12.6 Conclusion**

Our purpose was to demonstrate the physical sense of the median set approach and its flexibility. In the first section, we indicated three features to be respected by interpolations. According to the first one, an accretion (resp. erosion) zone must continue to evolve by accretion (resp. erosion). This basic modality is fulfilled by all models of Sect. 12.4. The laws proposed in this section are far from being the only possible ones. In particular, each of the six examples of the section is given a same law for accretion and erosion, which is not at all an obligation. The second feature holds for the role of the past. In the approach of Sect. 12.4, this past reduces to the last two stages: they suffice to determine the starting shoreline, the "gradient", and the location of accretion/erosion (Fig. 12.11).

**Fig. 12.11** Two extrapolations of the shoreline of Fig. 12.3 by homotopic thickenings of sizes 25 (left) and 33 (right)

The third feature was the subject of Sect. 12.5, where a thickening is substituted for the dilation in the extrapolator, in order to preserve homotopy. Indeed, all extrapolation equations, from sections two to four, can be rewritten by replacing the unit disc erosion and dilation by unit cycles of thinnings and thickenings, and the linear dilations by unidirectional thickenings. It would result in a series of algorithms where increasingness is lost (non direct extension to numerical functions) but where topological features are preserved.

Finally, as the weighted median of Eq. (12.4) is an increasing function of its two operands, it extends to numerical functions by means of their subgraphs, and allows to process colour images (Daya Sagar 2007).

**Acknowledgements** I am extremely grateful to Dr B.R. Kiran for his precious help in preparing this chapter.

### **References**


#### 12 Shoreline Extrapolations 237

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **Chapter 13 An Introduction to the Spatio-Temporal Analysis of Satellite Remote Sensing Data for Geostatisticians**

**A. F. Militino, M. D. Ugarte and U. Pérez-Goya**

**Abstract** Satellite remote sensing data have become available in meteorology, agriculture, forestry, geology, regional planning, hydrology or natural environment sciences since several decades ago, because satellites provide routinely high quality images with different temporal and spatial resolutions. Joining, combining or smoothing these images for a better quality of information is a challenge not always properly solved. In this regard, geostatistics, as the spatio-temporal stochastic techniques of geo-referenced data, is a very helpful and powerful tool not enough explored in this area yet. Here, we analyze the current use of some of the geostatistical tools in satellite image analysis, and provide an introduction to this subject for potential researchers.

### **13.1 Introduction**

The spatio-temporal analysis of satellite remote sensing data using geostatistical tools is still scarce when comparing with other kinds of analyses. In this chapter we provide an introduction to this field for geostatisticians, empathising the importance of using the spatio-temporal stochastic methods in satellite imagery and providing

Department of Statistics and O.R., Public University of Navarra (Spain), Pamplona, Spain

e-mail: militino@unavarra.es

A. F. Militino (✉) ⋅ M. D. Ugarte ⋅ U. Pérez-Goya

U. Pérez-Goya e-mail: unai.perez@unavarra.es

A. F. Militino ⋅ M. D. Ugarte InaMat (Institute for Advanced Materials), Pamplona, Spain e-mail: lola@unavarra.es

a review of some applications (Sagar and Serra 2010). We explain how to proceed for accessing remote sensing data, and which are the common tools for downloading, pre-processing, analysing, interpolating, smoothing and modeling these data. The chapter encloses six additional sections where a short explanation of the state of the art in the analysis of remote sensing data using free statistical software is given. Particular attention is devoted to the use of geostatistical tools in this subject. Section 13.2 explains the profile and the main features of the most popular satellites. It also encompasses Sect. 13.2.1 for describing some R packages for importing, analysing, and managing satellite images. Section 13.3 explains how to retrieve two derived variables, the normalized difference vegetation index (NDVI) and the land surface temperature (LST). In Sect. 13.4 some common methods of pre-processing data after downloading satellite images are reviewed. Section 13.5 explains the importance of the spatial interpolation in remote sensing data and reviews the most popular interpolation methods. The actual scenario of the spatio-temporal geostatistics is reviewed in Sect. 13.6, where an additional subsection describes some *R* packages for using spatial and spatio-temporal geostatistics techniques with satellite images. The paper ends up with some conclusions in Sect. 13.7.

### **13.2 Satellite Images**

Satellite images are available since more than four decades ago, and since then there has been a notable improvement in quality, quantity, and accessibility of these images, making it easier to extract huge amounts of data from all over the Earth. We can retrieve data from the land or the ocean, from the coast or the mountains, and also from the atmosphere where advanced sensors give the opportunity of monitoring meteorological variables that are crucial for the study of the climatic change, the phenology trend, the changes in vegetation or many other environmental processes.

Remote sensing refers to the process of acquiring information from the Earth or the atmosphere using sensors or space shuttles platforms. Therefore, remote sensing is born as a crucial necessity when using satellite images for analyzing and converting them into different frames of data that can be managed with specific software. Nowadays, Landsat, Modis, Sentinel or Noaa are some of the most popular satellite missions among researchers and practitioners of remote sensing data because of the free accessibility. Next, we summarize the main characteristics of these missions:

1. LANDSAT, meaning Land+Satellite, represents the world's longest continuously acquired collection of space-based moderate-resolution land remote sensing data. See GLCF (2017) for details. It is available since 1972 from six satellites in the Landsat series. These satellites have been a major component of NASA's Earth observation program, with three primary sensors evolving over thirty years: MSS (Multi-spectral Scanner), TM (Thematic Mapper), and ETM+ (Enhanced Thematic Mapper Plus). Landsat supplies high resolution visible and infrared imagery, with thermal imagery, and a panchromatic image also available from the ETM+ sensor. Landsat also provides land cover facility to complement overall project goals of distributing a global, multi-temporal, multi-spectral and multi-resolution range of imagery appropriate for land cover analysis.


Remote sensing data of some of these missions can be accessed via the free statistical software R, publicly accessible in R Core Team (2017).

### *13.2.1 Access and Analysis of Satellite Images with R*

This subsection provides a summary of some R packages that can be used for downloading, importing, accessing, processing, and smoothing remote sensing data from satellite images.

1. dtwSat(Maus et al. 2016) implements the Time-Weighted Dynamic Time Warping (TWDTW) method for land use and land cover mapping using satellite image time series. TWDTW is based on the Dynamic Time Warping technique and it has achieved high accuracy for land use and land cover classification using satellite data.


### **13.3 Derived Variables from Remote Sensing Data**

When a satellite image is accessed, an assorted number of bands are provided. The combination of these bands can facilitate different types of remote sensing data. For example, extracting the Normalized Difference Vegetation Index (NDVI) can be done by a simple combination of bands. NDVI is an important index that reflects vegetation growth and it is closely related to the amount of photosynthetically absorbed active radiation as indicated by Slayback et al. (2003) and Tucker et al. (2005). It is calculated using the radiometric information obtained for the red (R) and nearinfrared (NIR) wavelengths of the electromagnetic spectrum in the following way: *NDVI* = ((*NIR*) − *R*)∕((*NIR*) + *R*) (Rouse Jr et al. 1974). As mentioned in Sobrino and Julien (2011), this parameter is sensitive to the blueness of the observed area,

**Fig. 13.1** (Left) NDVI Sentinel image of Funes village in Navarra, and (Right) NDVI for the whole Navarra (Spain)

which is closely related to the presence of vegetation. Although numerical limits of NDVI can vary for the vegetation classification, it is widely accepted that negative NDVI values correspond to water or snow. NDVI values close to zero could correspond to bare soils, yet these soils can show a high variability. Values between 0.2 and 0.5 (approximately) to sparse vegetation, and values between 0.6 and 1.0 conform to dense vegetation such as that found in temperate and tropical forests or crops at their peak growth stage. Therefore, NDVI provides a very valuable instrument for monitoring crops, vegetation, and forestry, and it is directly calculated in specific images by the aforementioned satellites missions. On the left of Fig. 13.1 a Sentinel NDVI satellite image of Funes, a village of Navarra (Spain) is shown, and on the right of the same Figure, the NDVI for the whole region of Navarra.

Another important variable derived with satellite images is the land surface temperature (LST), that can be retrieved with different algorithmic procedures. As an example Sobrino et al. (2004) compare three methods to retrieve the LST from thermal infrared data supplied by band 6 of the Thematic Mapper (TM) sensor onboard the Landsat 5 satellite. The first is based on the radiative transfer equation using in situ radiosounding data. The others are the mono-window algorithm developed by Qin et al. (2001) and the single-channel algorithm developed by Jiménez-Muñoz and Sobrino (2003). Many satellites platforms provide specific images of LST all over the Earth, because it is also a very outstanding variable for many environmental process. Figure 13.2 shows the daily land surface temperature in Navarra (Spain) the 13th of July 2015 from TERRA satellite.

**Fig. 13.2** Land Surface Temperature of Navarra the 13th of July 2015

### **13.4 Pre-processing**

The atmosphere is between the satellite and the Earth, and its effects over the electromagnetic radiation caused by the satellite can distort, blur or degrade the images. These effects must be corrected before the image processing. The correction consists of composing several images into a new single one. Different algorithms have been developed in the literature according to the derived variable. The most common method with NDVI is the maximum value composite (MVC) procedure (Holben 1986) that assigns the maximum value of the time-series of pixels across the composite period. Alternative techniques include using a bidirectional reflectance distribution function (BRDF-C) to select observations and the constraint view angle maximum value composite (CV-MVC) (MODIS 2017). For LST day/night it is common to average the cloud-free pixels over the compositing period (Vancutsem et al. 2010). Nowadays, many composite images can be directly downloaded with different spatial and temporal resolutions. For example, raw daily images can be downloaded from AQUA or TERRA satellites all over the world, but usually composite images are at least of weekly or bi-weekly temporal resolution.

Spatial and temporal resolutions are also different from the same or different satellites. High temporal resolution can be useful when tracking seasonal changes in vegetation on continental and global scales, but when downscaling to small regions, a higher spatial resolution is needed, and frequently with lower temporal resolution. At this step, numerical, physical or mechanical analyses solve the image pre-processing. Later, removing the effect of clouds or other atmospheric effects is also required, otherwise remote sensing data can be inaccurate. Sometimes, the highest presence of clouds determine the dropout of several images, but if they are only partially clouded, different approaches for eliminating these effects can be used. Noise reduction in image time series is neither simple nor straightforward. Many alternatives have been provided. For example R.HANTS macro of GRASS, SPIRITS, BISE, TIME-SAT, GAPFILL or the CACAO methods are very well spread. R.HANTS performs an harmonic analysis of time series in order to estimate missing values and identify outliers (Roerink et al. 2000). SPIRITS is a software that processes time series of images (Eerens et al. 2014). It was developed by PROBA-V data provider and gives four smoothing options, including MEAN (Interpolate missing values & apply Running Mean Filter RMF) and BISE (Best Index Slope Extraction), (Viovy et al. 1992). TIMESAT uses numerical procedures based on Fourier analysis, Gauss, double logistic or SavitzkyGolay filters (Jönsson and Eklundh 2004). GAPFILL uses quantile regression to produce smoothed images where the effect of the clouds have been reduced. Usually, every software has different requirements with regard to the number of images necessary for smoothing (Atkinson et al. 2012). Finally, CACAO software (Verger et al. 2013) provides smoothing, gap filling, and characterizing seasonal anomalies in satellite time series.

All these procedures give composite images that are smoothed versions of the raw images, but very often they are not completely free of noise. Many of the attributes that can be extracted from the combination of satellite image bands are still vulnerable to many atmospheric or electronic accidents. For example, highly reflective surfaces, including snow and clouds, and sun-glint over water bodies may saturate the reflective wavelength bands, with saturation varying spectrally and with the illumination geometry (Roy et al. 2016). Land surface temperature or normalized vegetation index are examples of attributes where these type of errors can be present. Therefore, after pre-processing is done, interpolation and smoothing methods can be very useful for drawing or detecting trend changes, clustering or many other processes on remote sensing data.

### **13.5 Spatial Interpolation**

Likely, interpolation and classification are among the most used tools with remote sensing data. Classification of satellite images in supervised or unsupervised versions are important research areas not only with satellite images but also with big data and data mining where there are a great number of algorithmic procedures (Benz et al. 2004). Here, we are more interested in interpolation as it is more closely related to geostatistics.

Interpolation has been widely used in environmental sciences. Li and Heap (2011) revise more than 50 different spatial interpolation methods that can be summarized in three categories: non-geostatistical methods, geostatistical methods, and combined methods. All of them can be represented as weighted averages of sampled data. Among the non-geostatistical methods the authors find: nearest neighbours, inverse distance weighting, regression models, trend surface analysis, splines and local trend surfaces, thin plate splines, classification, and regression trees. The different versions of simple, ordinary, disjunctive or model-based kriging are among the geostatistical methods. The combined methods include: trend surface analysis combined with kriging, linear mixed models, regression trees combined with kriging or regression kriging.

Recently, Jin and Heap (2014) present an excellent review of spatial interpolation methods in environmental sciences introducing 10 methods from the machine learning field. These methods include support vector machines (SVM), random forests (RF), neural networks, neuro-fuzzy networks, boosted decision trees (BDT), the combination of SVM with inverse distance weighting (IDW) or ordinary kriging (OK), the combination of RF with IDW or OK (RFIDW, RFOK), general regression neural network (GRNN), the combination of GRNN with IDW or OK, and the combination of BDT with IDW or OK. Although all these methods were not developed specifically for remote sensing data, nowadays the majority of them have been implemented in different packages of the free statistical software R, and can be used with satellite images. Many of these methods are ready to use and interpret, but the family of kriging methods as the core of geostatistics, are preferred and widely used.

### **13.6 Spatio-Temporal Interpolation**

Since the publication of the seminal book *Spatial Autocorrelation* (Cliff and Ord 1973), and at latter date *Spatial Statistics* (Ripley 1981), *Statistics for Spatial Data* (Cressie and Wikle 2015), and *Multivarate Geostatistics* (Wackernagel 1995) books, there has been a rapid growth of spatial geostatistical methods, as they are essential tools for interpolating meteorological, physical, agricultural or environmental variables in locations where these variables are not observed.

The use of spatial geostatistics with remote sensing data is also very well widespread, and its procedures are present in many specific softwares of satellite image analysis (Stein et al. 1999). Geostatistics techniques can help to explore and describe the spatial variability, to design optimum sampling schemes, and to increase the accuracy estimation of the variables of interest. These models can be enriched with auxiliary information coming from classified land cover or historical information (Curran and Atkinson 1998). Kriging is the most popular geostatistical method with several versions such as block kriging, universal kriging, ordinary kriging, regression kriging or indicator kriging. It provides the spatial interpolation of different spatial variables through the use of spatial stochastic models, and it is the best linear unbiased predictor under normality assumptions when using spatially dependent data.

However, the extension to the spatio-temporal geostatistics methods is more complicated. Time series models typically assume a regularly sampling over time, but the temporal lag operator cannot be easily generalized to the spatial domain, where data are likely irregularly sampled (Phaedon and André 1999). Scales of time and space are different, therefore defining joint spatio-temporal covariance functions is not a trivial task (De Iaco et al. 2002). Recently, Cressie and Wikle (2015) show the state of the art in this area and explain the difficulties of inverting covariance matrices in spatio-temporal kriging, because it becomes problematic without some form of separable models or dimension reduction. Modelling the spatio-temporal dependence is frequently case-specific. Therefore, yet the presence of the spatio-temporal keyword is abundant in many satellite imagery papers, the use of spatio-temporal stochastic models is scarce. Very often, spate-time refers only to descriptive analyses of time series of satellite images where every image is analyzed as a set of separate pixels, i.e., when estimating trends, or trend changes, statistical methods of univariate time series are used for every pixel. For example, when completing, reconstructing or predicting the spatial and temporal dynamics of the future NDVI distribution many papers use a time series of images (Forkel et al. 2013; Tüshaus et al. 2014; Klisch and Atzberger 2016; Wang et al. 2016; Liu et al. 2015; Maselli et al. 2014). These studies include temporal correlation of individual pixels at different resolutions but ignoring spatial dependence among them.

Spatio-temporal stochastic models use the spatial or temporal dependence to estimate optimally local values from sampled data. In satellite images, sampled data can be a huge amount of spatially and temporally dependent pixels, if a sequence of images is involved. We briefly review in what follows some stochastic spatio-temporal models that can be used when analysing remote sensing data.


consists of representing a GF with Matérn covariance function as a Gaussian Markov Random Field (GMRF) through the Stochastic Partial Differential Equations (SPDE) approach. Then, the Integrated Nested Laplace Approximation (INLA) algorithm is proposed as an alternative to MCMC methods, giving rise to additional computational advantages (Rue et al. 2009).


### *13.6.1 Geostatistical R Packages*

In this section we briefly describe some of the most useful R packages for geostatistical analysis, including spatial and spatio-temporal interpolation in satellite imagery.

1. FRK (Cressie and Johannesson 2008) means fixed rank kriging and it is a tool for spatial/spatio-temporal modelling and prediction with large datasets.


### **13.7 Conclusions**

The multitemporal Earth observation satellites have been very well developed since the seventies, and along with the free availability of millions of satellite images, the number of publications of remote sensing data with geostatistical techniques has been rapidly increased. But unfortunately, not all published papers deriving, analysing or monitoring spatio-temporal evolutions, spatio-temporal trends or spatio-temporal changes are necessarily geostatistical papers, because they do not really use spatio-temporal stochastic models. These models are still scarce in remote sensing data because many of these models are computationally very intensive, or because they are not so broadly applicable as the spatial models are. The solutions found in the literature are very well fitted to specific problems, but we cannot always plug-in to other applications. The use of time series analysis in remote sensing opens a great window of opportunities for monitoring, smoothing, and detecting changes in large series of satellite images, but there are still many remote sensing papers ignoring the spatial dependence when analysing time series of images (Ban 2016). Instead, a huge discretization of the problem is presented where time-series of pixels are treated as spatially independent.

Nowadays, the upcoming opportunities for geostatisticians in remote sensing data are not based on the use of spatial models and time series separately, but on the use of spatial, temporal, or spatio-temporal stochastic models embedding both types of dependencies when necessary. Moreover, a single free statistical software like R is a powerful tool for downloading, importing, accessing, exploring, analysing and running advanced statistical modelling with remote sensing data in a row.

**Acknowledgements** This research was supported by the Spanish Ministry of Economy, Industry and Competitiveness (Project MTM2017-82553-R), the Government of Navarra (Project PI015, 2016 and Project PI043 2017), and by the Fundación Caja Navarra-UNED Pamplona (2016 and 2017).

### **References**


MODIS (2017) https://modis.gsfc.nasa.gov/about/


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **Chapter 14 Flint Drinking Water Crisis: A First Attempt to Model Geostatistically the Space-Time Distribution of Water Lead Levels**

### **Pierre Goovaerts**

**Abstract** The drinking water contamination crisis in Flint, Michigan has attracted national attention since extreme levels of lead were recorded following a switch in water supply that resulted in water with high chloride and no corrosion inhibitor flowing through the aging Flint water distribution system. Since Flint returned to its original source of drinking water on October 16, 2015, the State has conducted eleven bi-weekly sampling rounds, resulting in the collection of 4,120 water samples at 819 "sentinel" sites. This chapter describes the first geostatistical analysis of these data and illustrates the multiple challenges associated with modeling the space-time distribution of water lead levels across the city. Issues include sampling bias and the large nugget effect and short range of spatial autocorrelation displayed by the semivariogram. Temporal trends were modeled using linear regression with service line material, house age, poverty level, and their interaction with census tracts as independent variables. Residuals were then interpolated using kriging with three types of non-separable space-time covariance models. Cross-validation demonstrated the limited benefit of accounting for secondary information in trend models and the poor quality of predictions at unsampled sites caused by substantial fluctuations over a few hundred meters. The main benefit is to fill gaps in sampled time series for which the generalized product-sum and sum-metric models outperformed the metric model that ignores the greater variation across space relative to time (zonal anisotropy). Future research should incorporate the large database assembled through voluntary sampling as close to 20,000 data, albeit collected under non-uniform conditions, are available at a much greater sampling density.

© The Author(s) 2018

P. Goovaerts (✉)

BioMedware, Inc, 11487 Highland Hills Drive, Jerome, MI 49249, USA e-mail: goovaerts@biomedware.com

B. S. Daya Sagar et al. (eds.), *Handbook of Mathematical Geosciences*, https://doi.org/10.1007/978-3-319-78999-6\_14

### **14.1 Introduction**

The drinking water contamination crisis in Flint, Michigan has attracted national attention since extreme levels of lead were recorded in local water supplies and the percentage of children with elevated blood lead levels (BLL) increased in neighborhoods with the highest water lead levels (WLL). Problems started when the City of Flint, Michigan adopted the cost-saving decision of drawing and treating water from the Flint River instead of relying on the Detroit Water and Sewerage Department's system (DWSD) for its public water supply. A few months later, in December 2014, water samples showed elevated levels of trihaloethanes (THMs) a disinfection byproduct of chlorine, as well as high levels of lead and copper. A public health emergency was declared and residents were told to avoid drinking the water until it was tested or approved water filters were installed. In July 2015, public concerns were raised that lead and copper were being leached from corrosion (chlorine-induced) in the underground lead service lines and home plumbing fixtures as a result of not using corrosion control treatment (CCT). In August and September 2015, 16.6% of the 271 water samples collected by a Virginia Tech's team were found to exceed the EPA action level of 15 μg/L (ATSDR 2010). In September and October 2015, elevated childhood blood lead levels were confirmed and an emergency response was initiated (Hanna-Attisha et al. 2016), leading the city to switch back to the DWSD water supply on October 16, 2015.

Starting in February 2016, samples were collected bi-weekly at more than 600 sentinel sites chosen by the EPA and MDEQ (Michigan Department of Environmental Quality) across the city to determine the general health of the distribution system and to track changes in lead concentrations over time (Flint Safe Drinking Water Task Force 2016). After five rounds of sentinel sampling, a new sentinel program called "Extended Sentinel Site Program" started in June 2016, targeting specifically sites with high WLL during previous rounds or located in the highest-risk areas. Six additional sampling rounds were conducted for this smaller network including fewer than 200 sites. Overall these 11 sampling rounds resulted in the collection of 4,120 data at 819 different sites over a 40-week time period. This State-controlled monitoring program was supplemented by a voluntary or homeowner-driven sampling whereby concerned citizens received a testing kit and conducted sampling on their own (Goovaerts 2017a, b). Despite the larger size of this database (18,760 samples collected over 53 weeks at 10,341 sites), its heterogeneity and lack of systematic sampling across time prohibited its use in the present space-time analysis.

Except for a few graphs and location maps, the database assembled by the City of Flint and made available online has not undergone any rigorous statistical treatment by State employees and only a few studies have been published so far. Using a data-driven approach Abernethy et al. (2016) developed an ensemble of predictive models (e.g., random forest, logistic regression, linear discriminant analysis) to assess the risk of lead contamination in individual homes and neighborhoods in Flint. They trained these models using a wide range of data sources, including residential water tests, historical records, and city infrastructure data. Their analysis however ignored the spatial correlation among data and did not include a temporal component. A time trend analysis was conducted by Goovaerts (2017a) who used joinpoint regression to model time series of lead levels collected by the state-controlled and voluntary sampling programs. This analysis carried out at the city and ward levels still ignored the spatial correlation among data and did not provide any tax parcel-based prediction. A space-time analysis of these data should however provide important information to identify residences where high levels of lead are expected. It would also support any assessment of past and current lead exposures among the population at risk, particularly pregnant women and children.

Geostatistical techniques have been routinely used to analyze and map the spatial variability of soil and sediment lead concentrations (Goovaerts et al. 1997; Cattle et al. 2002; Solt et al. 2015), yet their application to lead in drinking water is far less common and mainly concerns groundwater quality (Siddique et al. 2012). A recent study (Wang et al. 2014) applied geographic information systems (GIS) and a hydraulic model of distribution systems to test the influences of pipe material, pipe age, water age, and other water quality parameters on lead/copper leaching in Raleigh (NC). In Symanski et al. (2004), mixed effect models were used to assess spatial fluctuations, temporal variability, and errors due to sampling and analysis for levels of disinfection by-products in water samples collected in households within the same distribution system. To the author's knowledge, the present study is however the first application of geostatistics to lead in drinking water within a distribution system.

This chapter describes a new methodology to predict lead level in tap water, accounting for WLL measurements collected in neighboring houses, housing characteristics (e.g., age of the house or presence of lead pipes), and temporal trends (e.g., decline since return to pre-crisis source of drinking water). Linear regression was used to model temporal trends at sentinel sites, accounting for the composition of service line (SL), construction year, poverty level, and census tracts as covariates. Cross-validation analysis allowed one to assess the benefit of this approach and compare the results obtained using three different types of space-time covariance models. Both the cases of predicting unsampled times at monitored locations (i.e., filling gaps in time series) and making predictions at unsampled locations were investigated.

### **14.2 Materials and Methods**

### *14.2.1 Datasets*

4,150 WLL measurements recorded over the period 2/20/2016-11/20/2016 were downloaded from http://www.michigan.gov/flintwater (residential testing results).

**Table 14.1** Datasets available for the space-time analysis: 4,120 water lead levels measured over 11 sampling rounds. Statistics include the number of data available, the sampling period, the percentage of WLL above 15 μg/L, the mean of logtransformed concentrations, and the composition of service line that was recorded for each sentinel site (three main categories besides plastic, unknown, and other)


Data were then allocated to an individual tax parcel unit on the basis of their postal address. Data with incomplete address (two samples) or duplicates (e.g., samples taken from two different faucets on the same day in the same house) were discarded, leading to a total of 4,120 samples collected at 819 different sites; see Table 14.1. Because of their strongly positively skewed distribution (concentrations range from 0 to 5,986 μg/L) and large proportion of zero values (34.6%), data were transformed using the following formula Log10ð*z*+ 1Þ.

Sentinel sites were initially selected from a pool of 1,951 volunteer sites identified during door-to-door water distribution; in particular it included all 156 sites with lead or lead combination service lines according to City records. Other sites were added according to several criteria: (i) spatial distribution to ensure coverage of all nine City wards, (ii) measurements of high blood levels (Hanna-Attisha et al. 2016), and (iii) environmental justice considerations (e.g. presence of houses with lead-based paint, minority population, and lower socio-economic households). This


**Table 14.2** Statistics computed for time series of different lengths: number of sentinel sites, percentage of WLL above 15 μg/L, the mean of logtransformed concentrations, and the composition of service line

initial set evolved between sampling rounds as some residents stopped participating, while others asked to be included in the network (Goovaerts 2017b), which explains the fluctuation in the number of sampled sites during the first five rounds S1-S5: 607–621 (Table 14.1). Fewer sites (149–178) were then part of the "Extended Sentinel Site Program". Table 14.2 indicates that only 41 sites were sampled in all 11 rounds, while 80% of time series included five observations or less.

Each house selected to be part of the sentinel network was visited by a licensed plumber who classified the material of the service line coming into the home (i.e., customer-side service line) into six categories: lead, galvanized, copper, plastic, other, and unknown. Galvanized refers to iron pipe with a protective "galvanized" surface coating composed of zinc, lead, and cadmium, and therefore can be a long-term source of lead (Clark et al. 2015). The term "unknown" was used whenever the SL material could not be confirmed because, for example, the line was behind a wall or way back in a crawl space.

City records were the only source of service line data available for the majority of 56,039 tax parcels which were not part of the sentinel sampling program. These records are however inaccurate and lead to the over-identification of lead SLs, likely because old records were not updated as these lines were being replaced (Goovaerts 2017c). The same author found that construction year was a good predictor of service line material: galvanized lines were mostly found in pre-1934 houses, while the frequency of lead service lines (LSLs) peaked for houses built around World War II. This information was combined with field inspection data and city records to predict by indicator kriging the likelihood that a home has lead or galvanized SL (Goovaerts 2017c).

Besides service lines, lead in drinking water mainly comes from lead-based solder and lead-containing plumbing fixtures (Lee et al. 1989; Cartier et al. 2011). Plumbing material is usually related to the installation year of a plumbing system, which can be approximated by the year of construction. For example, most faucets purchased prior to 1997 were made of brass or chrome-plated brass containing up to 8 percent lead (Rabin 2008). Construction year was retrieved from the 2016 Parcels GIS layer. The attribute "Year\_built" was missing for 20,372 parcels and was estimated by ordinary kriging (Goovaerts 1997) with a mean absolute error of prediction of 6.43 years. Based on its relationship to water lead levels (Goovaerts 2017a), construction year was discretized into three classes: pre-1940, 1940–1959, and post-1959.

Poor workmanship as well as lack of regular maintenance can also lead to more corrosion and leaching, and the presence of lead particulates, such as disintegrating brass or detaching pieces of old solder (Wang et al. 2014). Socio-economic status was here assessed using 2015 ACS (American Community Survey) 5-year estimates of the percentage of the block group population living in households where the income is less than or equal to twice the federal "poverty level".

There are many other variables known to influence lead in drinking water. For example, longer water age (i.e., water travel time between the treatment plant and home plumbing system) can decrease the effectiveness of corrosion control; increasing leaching and water lead levels (US EPA 2002; Wang et al. 2014). This information was however unavailable for this study.

### *14.2.2 Space-Time Kriging and Covariance Models*

Let z(**u**α;t) denote the water lead level recorded on time *t* at sentinel site α georeferenced by the geographical coordinates **u**<sup>α</sup> = (xα,yα) of the corresponding tax parcel centroid. Prediction of z-value at unsampled time *t*<sup>0</sup> and location *u*<sup>0</sup> was conducted using the following kriging estimator:

$$Z^\*(\mu\_0; t\_0) = \sum\_{t=t\_0-\Delta t}^{t\_0+\Delta t} \sum\_{a=1}^{n(t)} \lambda\_{at} \times z(\mu\_a; t) \tag{14.1}$$

n(t) is the number of observations recorded at time *t*, within the time window 2Δt, that were retained for estimation. The weights *λα<sup>t</sup>* are solution of the following space-time (ST) kriging system:

$$\sum\_{t=t\_0-\Delta t}^{t\_0+\Delta t} \sum\_{a=1}^{n(t)} \lambda\_{at} C(u\_a - u\_\beta; t - t') + \mu = C\left(u\_0 - u\_\beta; t\_0 - t'\right) \quad \beta = 1, \cdots, n\left(t'\right)$$

$$\sum\_{t=t\_0-\Delta t}^{t\_0+\Delta t} \sum\_{a=1}^{n(t)} \lambda\_{at} = 1 \tag{14.2}$$

The parameter *μ* is a Lagrange multiplier accounting for the constraint on the weights. The term *C u<sup>α</sup>* −*uβ*; *t* − *t* ′ is the ST covariance between any two observations recorded at locations *u<sup>α</sup>* and *u<sup>β</sup>* at times *t* and *t* ′ , respectively. Euclidian distances were used here since most lead in drinking water comes from premise plumbing materials and service lines instead of being transported through water mains (Del Toral et al. 2013; EET Inc. 2015).

One challenge associated with the application of ST kriging is the choice of a ST covariance model within the ever growing class of models (Montero et al. 2015). The following three non-separable ST covariance models were compared in the present study:

• The generalized product-sum model (De Iaco et al. 2002):

$$C(h,\tau) = k\_1 C\_s(h) + k\_2 C\_t(\tau) + k\_3 C\_s(h) C\_t(\tau) \tag{14.3}$$

where *k*1, *k*2, and *k*<sup>3</sup> are non-negative (strictly positive for *k*3) coefficients estimated from the sills of the spatial, temporal, and spatio-temporal semivariograms (De Cesare et al. 2002).

• The metric model (Dimitrakopoulos and Luo 1994):

$$C(h,\tau) = C\_{\rm sf} \left( \sqrt{\left(\frac{h}{a\_s}\right)^2 + \left(\frac{\tau}{a\_t}\right)^2} \right) \tag{14.4}$$

where a normalized space-time distance measure is created by rescaling the spatial and temporal lags, *h* and *τ*, by the ranges of the spatial and temporal semivariograms, *as* and *at* (case of geometric anisotropy).

• The sum-metric model (Heuvelink and Griffith 2010):

$$C(h,\tau) = C\_s(h) + C\_t(\tau) + C\_{st} \left( \sqrt{\left(\frac{h}{a\_s}\right)^2 + \left(\frac{\tau}{a\_l}\right)^2} \right) \tag{14.5}$$

This model combines characteristics of the two previous models: (i) sum of spatial and temporal covariances allowing for the presence of zonal anisotropies (i.e., semivariogram sills are not the same in all directions), and (ii) a metric ST model for the residual variability (geometric anisotropy).

Two other classes of non-separable ST covariance models, Cressie-Huang model (Cressie and Huang 1999) and Gneiting models (Gneiting 2002), were not considered because: (1) the fitting of these models needs a complex iterative parameter optimization technique (De Iaco 2010), whereas the three selected models can be fitted using straightforward techniques similar to those already used for spatial-only and temporal-only semivariograms, and (2) recent studies (Guo et al. 2015) indicate that these two more complex models provide similar fits to experimental ST semivariograms and comparable prediction accuracy as the product-sum model, confirming previous findings (De Iaco 2010).

The main difficulty in the practical implementation of the product-sum and sum-metric models is the inference of the sill of the ST semivariogram model, *Cst*ð0Þ, which is most often estimated visually from the 3D plot of the experimental ST semivariogram *<sup>γ</sup>st*̂ <sup>ð</sup>*h*, *<sup>τ</sup>*<sup>Þ</sup> (e.g., De Cesare et al. 2002; Heuvelink and Griffith 2010). In order to make the fitting procedure more user-friendly, the space-time sill *Cst*ð0Þ was here computed as the following weighted average of experimental space-time semivariogram values:

$$C\_{\rm{st}}(0) = \frac{1}{\sum\_{h} \sum\_{\tau} w\_{h,\tau}} \sum\_{h} \sum\_{\tau} w\_{h,\tau} \hat{\boldsymbol{\gamma}}\_{\rm{st}}(h,\tau) \quad \text{if } \hat{\boldsymbol{\gamma}}\_{\rm{st}}(h,\tau) \ge \mathbf{g}\_c \tag{14.6}$$

where the weight *wh*, *<sup>τ</sup>* is the number of data pairs falling into the class of spatial and temporal lags ð*h*, *τ*Þ. Only the classes where the ST semivariogram values exceed a critical sill *gc*, defined as the maximum of the spatial and temporal sills, were used.

### *14.2.3 Accounting for Secondary Information*

Lead service lines are widely considered the main source of lead in drinking water (Lee et al. 1989; Clark et al. 2015). Another culprit is lead fixtures and pipes present within old houses (premises plumbing), and poverty can compound the problems through the lack of maintenance. Goovaerts (2017a) also found that temporal trends can vary greatly across the city. This secondary information was here incorporated in the definition of a stochastic trend model *<sup>M</sup>*ð*u*; *<sup>t</sup>*Þ, leading to the following decomposition of the space-time random function (RF) (Kyriakidis and Journel 1999):

$$Z(u;t) = M(u;t) + R(u;t) \tag{14.7}$$

where *M*ð*u*; *t*Þ is a nonstationary spatiotemporal RF modeling the space-time distribution of the mean process, with *E M*½ ð*u*; *t*Þ = *m*ð*u*; *t*Þ and *R*ð*u*; *t*Þ is a zero mean stationary spatiotemporal RF modeling space-time fluctuations around *M*ð*u*; *t*Þ.

The trend component at each sentinel site **u**<sup>α</sup> was fitted using a linear model including six fixed factors: presence/absence of LSL, presence/absence of galvanized service line (GSL), time since first sample was collected (TIME), poverty level (POV), house age (AGE), and census tract (CT). The model takes the following form:

$$\begin{aligned} M(u,t) = LSL(u) \times TIME + CT(u) \times TIME + LSL(u) \times CT(u) \\ + GSL(u) \times CT(u) + AGE(u) \times CT(u) \\ + POV(u) \times CT(u) \end{aligned} \tag{14.8}$$

This model naturally handles uneven spacing of repeated measurements within each time series, as well as their correlation which was modeled using a spherical variance-covariance structure. Once the trend model was fitted, regression residuals were interpolated using space-time simple kriging and the ST covariance models introduced in Sect. 14.2.2.

### *14.2.4 Cross-Validation*

The accuracy of the predictive models created by the different approaches (e.g., three types of ST covariance models, univariate vs incorporation of secondary information) was assessed by cross-validation whereby each observation or time series (i.e., all data collected at the same site) was removed at a time and re-estimated using data collected at neighboring sentinel sites. The following performance criteria were then computed from *n* kriging estimates:

• the mean error (ME) of prediction as:

$$ME = \frac{1}{n} \sum\_{t=1}^{T} \sum\_{a=1}^{n(t)} \left( z^\*(\mu\_a; t) - z(\mu\_a; t) \right) \tag{14.9}$$

• the mean absolute error (MAE) of prediction as:

$$MAE = \frac{1}{n} \sum\_{t=1}^{T} \sum\_{a=1}^{n(t)} \left| z^\*(u\_a; t) - z(u\_a; t) \right| \tag{14.10}$$

• the mean square standardized residual (MSSR) as:

$$MSSR = \frac{1}{n} \sum\_{t=1}^{T} \sum\_{a=1}^{n(t)} \frac{\left(z^\*(\mu\_a; t) - z(\mu\_a; t)\right)^2}{\sigma\_K^2(\mu\_a; t)}\tag{14.11}$$

where *σ*<sup>2</sup> *<sup>K</sup>*ð Þ *ua*; *t* is the kriging variance.

A mean error close to zero indicates a lack of bias, while the mean absolute error should be as small as possible. If the actual estimation error is equal, on average, to the error predicted by the model, the MSSR statistic should be about one (Wackernagel 1998, p. 91).

One application of the predictive models is to prioritize any further sampling or intervention by ranking tax parcels from highly hazardous to less hazardous on the basis of kriging estimates. The ability of this ranking to identify successfully sites where WLL is greater or equal to the EPA action level of 15 μg/L was assessed using Receiver Operating Characteristics (ROC) curves which plot the probability of false positive versus the probability of detection (Swets 1988; Fawcett 2006; Goovaerts et al. 2016). The accuracy of the classification was quantified using the relative area under the ROC curve (AUC statistic), which ranges from 0 (worst case) to 1 (best case). The AUC is equivalent to the probability that the classifier will rank a randomly chosen positive instance (e.g., *zc* ≥ 15 μg ̸L) higher than a randomly chosen negative instance (e.g., *zc* < 15 μg ̸L).

### **14.3 Results and Discussion**

### *14.3.1 Spatial Distribution*

Figure 14.1a shows the location of all 819 sentinel sites within the nine wards in the city of Flint. Site-specific statistics such as number of observations and average log concentrations recorded for each time series, as well as composition of service line (GSL vs. LSL), were aggregated at the census tract level for better visualization. Geographical clusters of sentinel sites can be distinguished in several census tracts (e.g. border of wards 2 and 6, wards 7 and 9) which tend to be the tracts with the largest WLLs (Fig. 14.1c) and percentages of sampled LSLs (Fig. 14.1d). There is also a clear spatial trend with fewer lead service lines (e.g., none in Ward 1) and shorter time series (Fig. 14.1b) sampled in the Northern part of the city. Ward 5 includes the oldest neighborhood where GSLs are prevalent (Fig. 14.1e), while LSLs appear as small clusters, in particular in wards 6, 7 and 9 (Goovaerts 2017c).

### *14.3.2 Temporal Trend Modeling*

Temporal trends for the three major types of service line were visualized by aggregating observations within non-overlapping 14-day windows, which corresponds to the average time interval between sampling rounds during the first phase (Round S) of the sentinel monitoring program (Table 14.1). Except for LSLs water lead levels do not appear to have declined over the 40-week sampling period; actually they seem to have slightly increased for GSLs (Fig. 14.2a). These results are however a direct artifact of the sampling strategy whereby 80% of sentinel sites

**Fig. 14.1 a** Location of sentinel sites in each of the nine wards, and several census tract-level statistics: **b** percentages of time series (TS) including more than five observations, **c** average water lead levels, **d** percentage of sites with lead service lines, **e** percentage of sites with galvanized service lines. Shaded polygons indicate census tracts that do not include any sentinel site (missing values)

were not sampled beyond week 16, while sampling continued at sites where the risk of exceeding the EPA action level of 15 μg/L was the greatest (Table 14.2).

After elimination of all sites where fewer than six observations were collected, the averaged time series display the expected decline (Fig. 14.2b). The impact is minimal for LSLs since most of these sites are considered at risk and were sampled during both the initial and extended sentinel sampling programs (Rounds S and X).

**Fig. 14.2** Time series of observed (solid line) and predicted by regression (dashed line) water lead levels computed on average for the three major types of service line: lead, galvanized, and copper. Results (log transformed concentrations) are calculated from: **a** all sites, and **b** subset of sites where at least six observations were recorded

The selection bias is stronger for copper and galvanized lines, which explains the larger water lead levels recorded during the first 16 weeks relative to LSLs.

This sampling bias complicated greatly the modeling of temporal trends by regression. Indeed using all the data would underestimate the weekly rate of decline of water lead levels, whereas subsetting the dataset (e.g., using only time series including more than five data points as in Fig. 14.2b) will result in overestimating the concentrations at a majority of sites. In addition, the time series length cannot be used as covariate in the model to allow its application at unmonitored locations. Two modeling strategies were considered in this chapter. First, because of its relationship with time series length (Fig. 14.1) census tract was used as covariate in the regression model (Eq. 14.8). The second more complicated approach was to allow the intercept to fluctuate among sentinel sites, even when located within the same tract; i.e., use a mixed model where the intercept is modeled as a random effect. The trade-off cost for this added flexibility was the need to estimate the intercept at unmonitored locations, which was accomplished using ordinary kriging. Despite providing a better fit than the first alternative, the mixed model did not lead to more accurate kriging estimates, hence only the first option is discussed hereafter.

All six interaction terms in the trend model (Eq. 14.8) were highly significant (α = 0.01). The correlation between predicted and observed WLL is however rather weak (r = 0.47), which illustrates the challenge of predicting spatial and temporal variations in lead for drinking water (Bailey and Russell 1981; Del Toral et al. 2013). While the output of the regression model provides a reasonable fit to the SL-specific time series computed using all the data (Fig. 14.2a), it underestimates water lead levels for LSL and GSL when using only time series including more than five data points (Fig. 14.2b).

### *14.3.3 Variography*

Semivariograms helped quantifying the scale and magnitude of the space-time variability displayed by the maps and time series of Figs. 14.1 and 14.2. The spatial semivariogram (Fig. 14.3a) shows three nested scales of spatial variability: (1) a long range (2.35 km) caused by the neighborhood effect since houses in the same neighborhood tend to be built at the same time (i.e., similar plumbing system) and have similar water age, (2) a short range (200 m) corresponding to variability between adjacent houses, and (3) a nugget effect or discontinuity at the origin which represents the variability among samples taken within the same tax parcel (i.e. different apartments and/or measurement error for samples taken within the same residence). The substantial short-range variability (71% of total sill) likely reflects the heterogeneity in housing conditions (e.g., renovated houses) as well as the lack of uniformity of sampling conducted by homeowners since even with simple instructions it is difficult to ensure strict adherence to any sampling protocol (Del Toral et al. 2013). This interpretation is confirmed by the similar short-range variability displayed by the semivariogram of regression residuals (Fig. 14.3a, lower blue curve) since the regression model (Eq. 14.8) does not account for sampling characteristics. It is noteworthy that the longer range of 2.35 km is still fairly small relative to the size of the city (see legend of Fig. 14.1a), while the average separation distance between each sentinel site and the closest neighbor (293 m) exceeds the shortest range (200 m) that encapsulates 71% of the total spatial variability.

The temporal semivariogram (Fig. 14.3b) also displays three nested scales of variability although the longer range structure (110 days) represents here 53% of the total variability. Another difference with the spatial case is the overlap of

◀**Fig. 14.3** Experimental semivariograms with the model fitted that were used to form the three types of ST covariance models (Eqs. 14.3–14.5) **a** spatial semivariogram (lower curve is for residuals), **b** temporal semivariogram, **c** metric semivariogram for WLLs, **d** metric semivariogram for regression residuals, **e** metric residual semivariogram (sum-metric model) for WLLs, **f** metric residual semivariogram for regression residuals

temporal semivariograms for WLLs and regression residuals, illustrating the inability of the trend model (Eq. 14.8) to capture purely temporal changes. This result is in agreement with the small magnitude of changes displayed by the time series of predicted values in Fig. 14.2 (dashed line). Comparison of the total sills of spatial and temporal semivariograms (Fig. 14.3a–b) indicates that the variability observed across space is greater than the temporal variability. Such zonal anisotropy is in conflict with the assumption underlying the metric ST covariance model (Eq. 14.4).

Figure 14.3c–d show the semivariograms computed using a normalized space-time distance (metric model). Because the spatial and temporal lags were rescaled using different constants for the WLL and residual semivariograms, these two curves are plotted separately. The vertical axis is however comparable and illustrates the smaller variability of residuals (i.e., lower sill for the semivariogram of Fig. 14.3d). Once again, both semivariograms display substantial short-range variability. The last two semivariograms (Fig. 14.3e–f) represent the metric space-time model that captures the residual variability in the sum-metric model (Eq. 14.5).

### *14.3.4 Cross-Validation Analysis*

The semivariogram models of Fig. 14.3 were used to conduct a cross-validation analysis whereby one observation (LOO approach) or one time series (LTO approach) was removed at a time and re-estimated using data collected at neighboring sentinel sites. Based on a sensitivity analysis using ST ordinary kriging and MAE criterion, 48 observations with a maximum of three data points per site were retained for the estimation by univariate and residual ST kriging. Results obtained for predictions by the time trend model were also included as reference in Table 14.3.

The first three rows in Table 14.3 indicate that all algorithms give unbiased predictions (ME close to zero). As expected, the best prediction scores (i.e., lower MAE and higher AUC) are obtained when using data from the same time series (LOO approach) instead of relying solely on non-colocated data (LTO approach). Except for MSSR the product-sum model performs best, with the sum-metric model being a close second. The metric model underperforms the other two models because the combination of both spatial and temporal dimensions through a normalized space-time distance leads one to underestimate the correlation among observations of the same time series. In other words, the assumption underlying the

**Table 14.3** Results of cross-validation analysis conducted by leaving one observation out (LOO) or one time series out (LTO) at a time. The four performance criteria described in Sect. 14.2.4 were computed for three types of space-time covariance models (generalized product-sum, metric, and sum-metric) and three space-time interpolation algorithms (ST ordinary kriging, trend model fitted by linear regression with and without interpolation by ST residual kriging)


a value for trend model is the same for all six combinations

metric model is incompatible with the zonal anisotropy detected on Fig. 14.3. Accounting for secondary information through residual kriging slightly improves the prediction relative to ST ordinary kriging; both kriging algorithms outperformed the trend model.

These results however apply only to the narrow situation where exposure to lead in drinking water is reconstructed at the sole sentinel sites. For prediction at sites where no data was collected, LTO results indicate that differences between ST covariance models are much smaller as purely temporal correlations are not used in the kriging system. Nevertheless, the product-sum model still performs best. The LTO approach also emphasizes the benefit of using trend models that account for secondary information (i.e., larger differences between residual kriging and ordinary kriging). Yet, prediction performances actually deteriorate when kriged residuals are added to the trend model: the sole trend model gives better prediction than residual kriging. It is however noteworthy that the trend model was not cross-validated, hence the observation being predicted was used to create the model.

**Fig. 14.4** Impact of the size of kriging search window on several statistics computed by the leave one time series out (LTO) approach: **a** mean absolute error of prediction, and **b** area under the ROC curve. Horizontal dashed lines represent the values obtained for the time trend model created by linear regression. **c** percentages of search windows that include at least one observation when centered on sampled sentinel sites or tax parcels

Because of the substantial short-scale spatial variability retaining increasingly distant data is expected to add more and more noise to the kriging estimate. This was investigated by changing the search strategy and selecting only sentinel sites located within a given distance of the site being predicted. If no data was located within the search radius, the kriged residual was zero and the residual kriging estimate was simply the value of the trend model. Figure 14.4 shows results of this sensitivity analysis conducted for the product-sum model over distances ranging from 50 m to 1 km. For the mean error of prediction the little benefit of residual kriging vanishes as soon as data beyond 100 m are used in the estimation (Fig. 14.4a), while this distance is 200 m for the area under the ROC curve (Fig. 14.4b). Figure 14.4c indicates that 42% of sentinel sites have another sentinel site within 100 m, while this percentage is only 4.6% for tax parcels (Fig. 14.4c). In other words, there is little benefit in applying geostatistics to model the space-time distribution of WLL over the 56,039 tax parcels in Flint using the data collected at sentinel sites.

### **14.4 Conclusions**

This chapter presented the first application of space-time geostatistics to lead levels recorded in drinking water of a public distribution system. The methodology was illustrated using 4,120 water samples that were collected at 819 "sentinel" sites over a 40-week period in the city of Flint. Despite a sizable database assembled by the State of Michigan, the geostatistical analysis was hampered by a temporal sampling bias and the existence of substantial variability over a few hundred meters. Unlike other countries such as Canada or France, sampling is not conducted by a trained technician in the US. Instead, homeowners are expected to collect water samples after a minimum of 6 h. of stagnation (e.g., overnight stagnation) following specific instructions (US EPA 2016), which can cause substantial variability among households. Other sources of fluctuation include heterogeneity in the plumbing system (e.g., renovation, installation of a new meter), location of sampled faucets (e.g., bathroom vs. kitchen), or water temperature (e.g., lead solubility increases with water temperature), to name a few.

In the present case-study, space-time kriging proved beneficial only in the situation where observations had been collected at the site being predicted; i.e., to fill the gaps in time series. The generalized product-sum and sum-metric space-time covariance models then outperformed the metric model that ignores the greater variation across space relative to time (zonal anisotropy). Sentinel sites represent however only 1.5% of tax parcels in the city of Flint. At unsampled sites the kriging prediction was no better than the temporal trend estimated by linear regression and it turned out to become less accurate if no data was collected within 100 meters. Although the regression model included site-specific characteristics, such as construction year and composition of service lines, it was unable to explain the short-range variability, leaving 78% of the total variance unaccounted for (R<sup>2</sup> = 22%).

In the future, several approaches will be investigated to tackle the impact of short-range variability on prediction. First, the data analyzed in this chapter represent less than 20% of the water samples available for the city of Flint. The majority of samples were collected by voluntary sampling whereby concerned citizens received a testing kit and conducted sampling on their own (Goovaerts 2017a, b). Despite the lack of periodic sampling in time and existence of temporal bias (e.g., houses with low lead levels were less likely to be tested again) the greater spatial coverage (i.e., more than 18% of tax parcels sampled) will reduce substantially the average distance between a tax parcel and the closest observation. However, spatial heterogeneity will likely still be present over short distances, leading one to question our ability to make prediction at the tax parcel level. More appropriate spatial supports for prediction could be census block groups which are statistical divisions of census tracts and are generally defined to contain between 600 and 3,000 people. The city of Flint includes 132 block groups and 40 census tracts. Such spatial aggregation or upscaling would be a way to filter between-household fluctuations which appears to be mainly noise. As more US cities are facing similar drinking water crisis, reliable techniques for sampling and modeling spatial and temporal changes in water lead levels will be sorely needed.

**Acknowledgements** This research was funded by grant R44 ES022113-02 from the National Institute of Environmental Health Sciences. The views stated in this publication are those of the author and do not necessarily represent the official views of the NIEHS.

### **References**


files/2015-09/documents/2007\_05\_18\_disinfection\_tcr\_whitepaper\_tcr\_waterdistribution.pdf. Accessed 26 May 2017

US Environmental Protection Agency, Office of Ground Water & Drinking Water (2016) Memorandum: clarification of recommended tap sampling procedures for purposes of the lead and copper rule. https://www.epa.gov/sites/production/files/2016-02/documents/epa\_ lcr\_sampling\_memorandum\_dated\_february\_29\_2016\_508.pdf. Accessed 26 May 2017

Wackernagel H (1998) Multivariate geostatistics, 2nd completely revised edition. Springer, Berlin

Wang Z, Devine H, Zhang W et al (2014) Using a GIS and GIS-assisted water quality model to analyze the deterministic factors for lead and copper corrosion in drinking water distribution systems. J Environ Eng 140:A4014004

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **Chapter 15 Statistical Parametric Mapping for Geoscience Applications**

**Sean A. McKenna**

**Abstract** Spatial fields represent a common representation of continuous geoscience and environmental variables. Examples include permeability, porosity, mineral content, contaminant levels, seismic impedance, elevation, and reflectance/ absorption in satellite imagery. Identifying differences between spatial fields is often of interest as those differences may represent key indicators of change. Defining a significant difference is often problem specific, but generally includes some measure of both the magnitude and the spatial extent of the difference. This chapter demonstrates a set of techniques available for the detection of anomalies in difference maps represented as multivariate spatial fields. The multiGaussian model is used as a model of spatially distributed error and several techniques based on the Euler characteristic are employed to define the significance of the number and size of excursion sets in the truncated multiGaussian field. This review draws heavily on developments made in the field of functional magnetic resonance imaging (fMRI) and applies them to several examples motivated by environmental and geoscience problems.

### **15.1 Introduction**

A general problem in geological and environmental investigations is rapid and accurate identification of anomalous measurements from one, two or three-dimensional data. Example applications include cluster identification in spatial point processes (e.g., Byers and Raftery 1998; Cressie and Collins 2001) detection of anomalies in remotely sensed imagery (e.g., Stein et al. 2002) and identification of anomalous clusters in lattice data (e.g. Goovaerts 2009).

IBM Research, Dublin, Ireland e-mail: seanmcke@ie.ibm.com

© The Author(s) 2018 B. S. Daya Sagar et al. (eds.), *Handbook of Mathematical Geosciences*, https://doi.org/10.1007/978-3-319-78999-6\_15

S. A. McKenna (✉)

The problem of anomaly detection is complicated when the data set is composed of more than a handful of variables (multi-variate) and becomes even more complex when the multiple variables comprise a random field exhibiting spatial correlation.

The temporal and/or spatial correlation of the data rules out the application of standard statistical tests for change detection and has also limited the development of hypothesis testing techniques for correlated data (Gilbert 1987). For applications with correlated data, simulation techniques can often be used to develop the null distribution, but development of closed form hypothesis tests for analysis of the spatial random fields associated with geostatistics has remained sparse.

One approach to detection of anomalies in spatially correlated data are Local Indicators of Spatial Association (LISA) statistics (Anselin 1995; Goovaerts et al. 2005; Goovaerts 2009). These tests focus on the local relationships between adjacent cells and explore combinations of cells defined with an adjacency matrix and or a moving window visiting all cells in a lattice. A very different approach is to model the difference between images as a continuous random field and use properties of an underlying random field model to identify anomalies.

Change detection in spatial-temporal data sets has received considerable attention over the past 15–20 years within the medical imaging research community (Brett et al. 2003; Friston et al. 1994, 1995; Worsley et al. 1992, 1996) and a significant development of this research has been Statistical Parametric Mapping (SPM).

The practice of statistical parametric mapping has been developed in the field of medical imaging, particularly in brain imaging, and in the practice of functional magnetic resonance imaging (fMRI) of the brain while the subject is performing various tasks (functions). Friston et al. (1995, p. 190) provide a concise definition of SPM: "*one proceeds by analyzing each voxel using any (univariate) statistical parametric test. The resulting statistics are assembled into an image, that is then interpreted as a spatially extended statistical process*". In other words, at each pixel (voxel) in an image, a univariate statistical test (e.g., *t*-test) is applied and the resulting values of the test statistic at each pixel are then displayed as a map. The underlying spatial correlation of the map is used in creating a multivariate statistical model that describes that map and this model can be used for inference. Typically, the resulting map is analyzed using theory that underlies stationary Gaussian fields and techniques developed for excursion sets of these fields. Properties of truncated Gaussian fields (e.g., Adler and Hasofer 1976; Adler 1981; Adler and Taylor 2007; Adler et al. 2009) serve as the basis of the SPM techniques.

To date, the SPM approach has not been applied outside of medical imaging, but it appears to be a technique that could be successfully applied in a number of areas of interest in the earth and environmental sciences. The goals of this work are to both describe the basis of SPM and then apply SPM to example problems.

### **15.2 Anomaly Detection with Statistical Parametric Mapping**

Anomaly detection is defined here as the identification of a region in time and/or space that is anomalous in its shape, size (duration) and/or values within the region (intensity). Two modes to anomaly detection in spatial-temporal data sets can be defined: (1) Anomaly detection in an online mode where prior data are used to predict future values of the measured variable and anomalies occur in areas and/or times where the predictions are inconsistent with the corresponding measurements; (2) Anomaly detection as the difference between two classes of data where differences in some treatment or external forcing condition is suspected to cause a difference in the measured variable. The anomalies in this case are significant differences in measured variables observed with and without activation of the external condition. This latter case is the focus of the work in this chapter.

Specifically, an ensemble of geologic models can be created in 1, 2 or 3 dimensions where each member of the ensemble is associated with a specific "treatment" or "result" that can be used to group ensemble members into separate classes. As examples:


Two measures of anomaly detection can be employed: omnibus and localized (Worsley et al. 1992). Omnibus detection uses a set of calculations to determine if the current curve, map or volume, taken as a whole, is anomalous. Localized detection determines the specific location(s) within the study domain where the anomaly occurs and are the focus of this work.

Anomaly detection is not done directly on the observed generated or observed ensemble members, but on a difference between groups of members as defined through the treatment or result. Here, the differences are calculated as the differences of two average values. The averages are calculated at each point, pixel or voxel within the domain using standard univariate statistical tests (e.g., *t*-test). Each pixel-wise average is calculated over a set of ensemble members created under a specific condition (treatment) or generating a specific result. For example, in studies of the human brain, images are often collected under "resting" and "stimulated" conditions and the average image from each condition is then used to create a difference map.

SPM was developed to directly address the problem of spatial correlation in statistical testing. Direct application of most statistical tests requires independence of the observations, but for many problems, including those studied here, correlation between adjacent observations is the norm. Therefore, the results of the statistical tests for adjacent, or even nearby, pixels cannot be effectively evaluated using standard techniques. SPM considers a single map comprised of the results of all local (pixel-wise) statistical tests and provides several measures for comparison of the values in the map to critical threshold levels.

### *15.2.1 MultiGaussian Fields*

The basis of the SPM approach is the analysis of the number, size and degree of excursions from a multiGaussian (mG) random field. For a concise, statistical description of mG fields, see Adler et al. (2009, p. 27). Stationary multiGaussian fields are fully defined by a mean and covariance matrix. In a practical sense, values at each pixel are defined with a Gaussian distribution. The correlation between those multiple distributions is defined by the covariance. Spatial correlation can be added to an uncorrelated field through the convolution of a smoothing kernel with an uncorrelated (white noise) field. As an example, the 2D Gaussian kernel is defined as

$$\mathbf{G}(\mathbf{x}, \mathbf{y}) = \frac{1}{2\pi |\boldsymbol{\Sigma}|^{1/2}} \exp\left(-\frac{1}{2} d\boldsymbol{\Sigma}^{-1} \mathbf{d}^{\mathrm{T}}\right),$$

where *d* is the distance vector containing distances *dx* and *dy* from any location (*x, y*) to the origin of the Gaussian function *x0, y0* (here (0, 0) for the standard normal distribution). In this work, the covariance matrix, *Σ* = *σ<sup>2</sup> I*, (*I* = identity matrix) is diagonal for the specific case of the kernel being aligned with the grid axes.

An often-used measure of the spatial bandwidth of a smoothing kernel in the image processing literature is the "full width at half maximum" (FWHM). For the Gaussian kernel above, the FWHM is:

$$FWHM = \sigma \sqrt{8 \ln(2)}$$

If the mG field is not created, but is obtained from some type of imagery or other analyses, then there is no known underlying kernel and it is necessary to estimate the FWHM directly from the image. Estimation can be done using the covariance matrix of the partial derivatives of the image values, *T*, with respect to the discretization of the image. In 2D, the covariance matrix is:

$$
\Lambda = \begin{bmatrix}
Var\left(\frac{\partial T}{\partial \mathbf{x}}\right) & Cov\left(\frac{\partial T}{\partial \mathbf{x}}, \frac{\partial T}{\partial \mathbf{y}}\right) \\
Cov\left(\frac{\partial T}{\partial \mathbf{x}}, \frac{\partial T}{\partial \mathbf{y}}\right) & Var\left(\frac{\partial T}{\partial \mathbf{y}}\right)
\end{bmatrix}.
$$

This covariance matrix can be interpreted as a measure of the roughness/ smoothness of the image.

Estimation of Λ can be achieved through several approaches and here the simple relationship defined by Worsley et al. (1992) between the FWHM values in each of the principal directions and Λ is utilized. The derivatives in the covariance matrix of an image can be approximated numerically in each spatial dimension with differences between adjacent pixels are calculated as:

$$\begin{aligned} Z\_{xi}(\mathbf{x}, \mathbf{y}) &= \left\{ T\_i(\mathbf{x} + \delta \mathbf{x}, \mathbf{y}) - T\_i(\mathbf{x}, \mathbf{y}) \right\} / \delta\_{\mathbf{x}}, \\ Z\_{yi}(\mathbf{x}, \mathbf{y}) &= \left\{ T\_i(\mathbf{x}, \mathbf{y} + \delta \mathbf{y}) - T\_i(\mathbf{x}, \mathbf{y}) \right\} / \delta\_{\mathbf{y}} \end{aligned}$$

where δ<sup>x</sup> and δ<sup>y</sup> are the dimensions of the image pixels in the x and y directions. The variances and covariances of the differences are then used to approximate the variances and covariances of the derivatives:

$$\begin{aligned} V\_{xx} &= \sum\_{i,\mathbf{x},\mathbf{y},\mathbf{z}} Z\_{\mathbf{z}i}(\mathbf{x},\mathbf{y},\mathbf{z})^2 / N(n-1) \\ V\_{yy} &= \sum\_{i,\mathbf{x},\mathbf{y},\mathbf{z}} Z\_{yi}(\mathbf{x},\mathbf{y},\mathbf{z})^2 / N(n-1) \\ V\_{xy} &= \sum\_{i,\mathbf{x},\mathbf{y},\mathbf{z}} \left\{ Z\_{\mathbf{z}i}(\mathbf{x},\mathbf{y},\mathbf{z}) + Z\_{\mathbf{z}i}(\mathbf{x},\mathbf{y} + \delta\_{\mathbf{y}},\mathbf{z}) \right\} \left\{ Z\_{\mathbf{z}i}(\mathbf{x},\mathbf{y},\mathbf{z}) + Z\_{\mathbf{z}i}(\mathbf{x} + \delta\_{\mathbf{x}},\mathbf{y},\mathbf{z}) \right\} / 4N(n-1) \end{aligned}$$

These variance and covariance estimates are used to estimate Λ:

$$
\Lambda = \begin{bmatrix} V\_{\text{xy}} & V\_{\text{xy}} \\ V\_{\text{xy}} & V\_{\text{yy}} \end{bmatrix}
$$

Finally, the FWHM in the X and Y directions are calculated as:

$$FWHM\_x = \sqrt{\frac{4\ln(2)}{V\_{xx}}}$$

$$FWHM\_y = \sqrt{\frac{4\ln(2)}{V\_{yy}}}$$

### *15.2.2 Calculating the SPM*

The Statistical Parametric Map is the difference image between individual pairs of images or average images, which is typically transformed from a map of t-statistics to a map of Gaussian *Z*-score values. The different methods used in this study for calculating the SPM are described in this section.

#### **15.2.2.1 Conditional Differences**

The *t*-test and *t*-statistic are used exclusively in this chapter for the conditional differences between two ensembles and a review of the *t*-statistic is provided in the Appendix. It is noted that other statistical tests and their resulting test-statistics, e.g., χ, *Z*, *f*, as well as measures of correlation can also be used as the basis of an SPM. For the *t*-tests employed here, a location (pixel)-specific calculation of the standard deviation is used. Another approach is to calculate the pooled standard deviation across the image (image-based) and arguments for using the image-based standard deviation are given by Worsley et al. (1992). In typical applications, the number of observations under each condition is small, near a dozen, and therefore the effective degrees of freedom for *T*(*x*, *y*) is generally small and needs to be used in the transformation of the *t*-field to a standard normal Gaussian *Z*-field.

The cumulative probability of a *t*-statistic is found from the *t*-distribution function with the appropriate degrees of freedom. This probability is then used with the inverse of the Gaussian distribution function to get the z-score value:

$$\begin{aligned}P(Y \le \mathbf{y}) &= T(\mathbf{y}; \boldsymbol{\nu})\\z = G^{-1}(P(Y \le \mathbf{y})) \end{aligned}$$

The resulting fields are now multiGaussian SPM's and the anomaly detection algorithms developed for SPM analysis can be applied.

#### **15.2.2.2 Isolated Regions of Activation**

Anomaly detection here is focused on the number, size and location of regions within an SPM that is a curve/image/volume that exceed a given threshold level, *u*. These regions are known as "regions of activation", "regions of exceedance" or "excursions". The numbers, sizes and locations of these excursions are then compared against a reference model of the expected expression of such regions. Truncation of a Gaussian field at a threshold *u* defines the u-level excursion set:

$$X\_{\boldsymbol{\mu}} = \left\{ \boldsymbol{x} \in \boldsymbol{R}^D \colon Y(\boldsymbol{x}) \ge \boldsymbol{\mu} \right\},$$

A large body of literature on the properties of excursion sets (regions of exceedence) in Gaussian random fields is available (e.g., Adler et al. 2009; Friston et al. 1994; Lantuejoul 2002). Friston et al. (1994) characterize three related properties of excursion sets in truncated Gaussian random fields:


with expectation relationship *E[N]* = *E[m]E[n]*. For threshold value, *u*, the number of cells above that threshold, *N*, is provided by the Gaussian cdf and the size of the domain, *S*:

$$\mathbf{E}[\mathbf{N}] = \mathcal{S} \int\_{\mu}^{\infty} \left(2\pi\right)^{-1/2} e^{-z^2/2} dz$$

A measure of the number of isolated regions above the threshold can be obtained from the Euler Characteristic, EC. In two dimensions, the EC represents the number of connected excursion sets in the domain minus the total number of holes within those sets. Therefore, EC goes to 0.0 at *u* = 0 and EC becomes negative when *u* < 0.0 as the truncated field represents a single domain-spanning set containing a large number of holes. In 2D, and at relatively high truncation thresholds, EC is equivalent to the number of regions above the threshold, E[m].

$$E[m] = EC = \left| \left( 2\pi \right)^{-\left( \left( D - 1 \right) \right)/2} W^{-D} S \mu^{(D-1)} e^{\mu^2/2} \right|^2$$

where *D* is the dimension of the domain and *W* is an alternative measure of the spatial correlation of the mG field defined as a fraction of the FWHM:

$$\mathbf{W} = \mathbf{F}\mathbf{W}\mathbf{H}\mathbf{M}/\sqrt{4\ln(2)}$$

For a given threshold, *u*. the average area of the individual regions is found from the expectation relationship:

$$\mathbf{E[n] = E[N]/E[m] = E[N]/[EC]}$$

Figure 15.1 compares a direct calculation of EC on a multiGaussian field using the Matlab Image Processing toolbox (Matlab 2009) with estimates made using the Euler characteristic equation above across a range of *u* values increasing from left to right. Deviations between the calculated and estimated number of excursions indicate deviations from the definition of a multiGaussian field. The corresponding binary fields (500 × 500 cells) are also shown for several representative threshold

**Fig. 15.1** Observed (calculated) and estimated Euler characteristic for a mG field as a function of the truncation threshold, u. The excursion sets for u > 0 are black regions in the binary fields at the top of the image (after McKenna et al. 2011)

values. Note, that typically the extreme ends of the graph corresponding to *u* values (truncation thresholds) with absolute values of 2.5 or greater are of interest.

### *15.2.3 Localized Anomaly Detection*

Further analysis of the excursion sets is focused on the size and location of the detected anomalies. The excursion set maps themselves can be examined to determine the location of where the excursions are occurring. An extremely localized, yet very strong anomaly will be of interest. An anomaly with a much lower amplitude but greater spatial extent may also be of interest. The definition of spatial extent (size) of any anomaly is defined relative to the spatial correlation length of the field in which it is detected. The size of the anomaly is expressed through truncation of the field at a threshold value and defining the size of the excursion regions above that threshold.

In general, the significance of any anomaly in a spatial field is a function of its amplitude (intensity or strength) and its spatial extent (size). The observed SPM is compared against a specified multivariate spatial random field with a defined correlation length that serves as the model of the null hypothesis for the differences between two ensembles of spatial fields. Truncation of the observed SPM at a given threshold level creates regions of excursions above that threshold and the significance of the number and size of these excursions relative to the model of the null hypothesis is calculated. As in classical statistical hypothesis testing, the *p*-value defines the chance that the observed anomaly would occur under the null hypothesis. Here, the focus is on identifying the largest region of excursion for a specified threshold and calculating the chances of that anomaly occurring under the null hypothesis.

The pre-processing steps and the approach used for application of statistical parametric mapping to detection of significant excursion sets is outlined here and these steps are then applied to an example problem. The focus is on the approach used for calculation of the probability that one or more regions of activation of a certain area, or larger, could have occurred by chance under the constructed mG model. The full development of this approach for medical imaging is provided by Friston (1994) and Worsley et al. (1996). Additionally, Adler (2000) and Taylor and Adler (2003) provide further development of level crossing in random fields and the relationship to the Euler characteristic.

Steps:

	- a. Calculate the FWHM of the smoothed and transformed SPM created in Steps 1–3. The FWHM is derived from the variances and covariances of the spatial derivatives of the SPM. The resulting FWHM values are typically 5– 15 times the size of the smoothing kernel used in Step 2.
	- b. Identify pixels that are above/below the ± threshold value.
	- c. Employ a flood-fill algorithm to determine the sizes of the separate regions of connected pixels, or regions of exceedance and label each region for both positive and negative excursions.

the null hypothesis of a Gaussian SPM with calculated FWHM is calculated. The significance of the maximum excursion size is calculated using the methods of Friston et al (1994):


$$\begin{aligned} P(n\_{\text{max}} \ge k) &= \sum\_{i=1}^{\infty} P(m=i) \cdot \left[ 1 - P(n < k)^i \right] \\ &= 1 - e^{-E[m] \cdot P(n \ge k)} \\ &= 1 - \exp(-E[m] \cdot e^{-\beta k^{2/D}}) \end{aligned}$$

where *β* = *[Γ(D/2* +*1) • E[m]/E[N]]2/D* and *D* is the dimension of the domain.

Calculations of *P(nmax* ≥ *k)* within the (*k, u*) parameter space for spatial fields with two different correlation lengths (FWHM) are shown in Fig. 15.2. The role of the correlation length of the null hypothesis model is clear from Fig. 15.2 where the probability of an excursion region of 60 pixels or more is approximately 0.001 for a field with a FWHM of 9.0, but is essentially zero (∼ 10 × 10<sup>−</sup>12) for a field with a FWHM of 3.0.

**Fig. 15.2** P(nmax ≥ k) as a function of size of the excursion region, k, and the truncation threshold, u, for fields of size 500 × 500 with an isotropic FWHM of 3.0 pixels (left) and 9.0 pixels (right). The color scale is log10(P(nmax ≥ k))

### **15.3 Example Problems**

Two example problems are used here to demonstrate the calculations and application of SPM to detecting anomalies in spatial random fields. Both example problems are two-dimensional, but the same approaches are applicable to anomaly detection in 1-D and 3-D domains.

### *15.3.1 Anomaly Detection in Images*

A simple simulation study designed to mimic the detection of anomalous regions in either remote sensing or geophysical imagery is used here to test a few of the SPM calculations. The focus is on identifying the largest anomaly above a specified threshold and the significance of that anomaly.

A multiGaussian field is created through geostatistical simulation. The field is comprised of square, 5 × 5 m pixels, and has an isotropic Gaussian variogram with a range of 150 pixels. The field is created in standard normal space, *N*(0, 1) and the simulated values serve as the observed image. Measurement noise is added to the image by considering the simulated realization value, *z(x)* to be the mean value of a local Gaussian distribution at every pixel. The standard deviation of the Gaussian at every pixel, σz(x), is set to 2.0 and a Gaussian random deviate is drawn and added to *z(x)* to create the final image. This measurement noise is added independently at every pixel (i.i.d.) and then smoothed prior to adding to the observed image. The amount of spatial smoothing of the noise term is varied and the impact on anomaly detection is examined.

Anomalies are added to the observed image within a circular region having a radius of 90 pixels and centered at the center of the image. Background values within the anomaly region are multiplied by 1.5 creating stronger negative and positive values within the region depending on the sign of the original observed values. The area of the anomaly region is 5027 pixels.

Figure 15.3 shows background images (left column) at two levels of noise smoothing and the background images with the anomalies added (right column). As would occur in any image capture process, the noise values added to each image are drawn randomly and independently from any other image prior to smoothing. This creates subtle differences between the images in each row of Fig. 15.3 even without the addition of anomalies. Detection of the presence of the anomalies through visual comparison of the left and right images in each row of Fig. 15.3 is not obvious, even when the location of the anomaly is known.

The SPM's are calculated through a pixelwise *t*-test for comparing two means (Appendix) between the image with and without the anomalies. These *t*-statistic maps are transformed to Gaussian *Z* maps that are the SPM (Fig. 15.4). The large anomalies in the center of the image are readily seen along with the dramatic changes in the results due to the increased spatial correlation of the noise

**Fig. 15.3** Background fields without (left column) and with (right column) added anomalies with a smoothing kernel size σ = 1.5 pixels (top row) and σ = 7.5 pixels (bottom row). Color scale units are arbitrary in this example

**Fig. 15.4** SPM's for the case of smoothing with a filter bandwidth of σ = 1.5 (left) and σ = 7.5 pixels (right). The color scale is in standard deviations away from the mean of zero


**Table 15.1** Results of SPM analysis for four levels of noise smoothing

component with increased smoothing. Additional SPM's are created at intermediate levels of smoothing but are not shown here. Results from all levels of noise smoothing are shown in Table 15.1.

A threshold of ±2.5 standard deviations is applied to the SPM's and the excursion regions for the two extreme levels of noise smoothing are shown in Fig. 15.5. There are over 200 positive and 200 negative excursions for the smallest amount of noise smoothing and only 1 positive and 1 negative excursion at the largest amount of smoothing. The size of the excursions that are due to the added anomalies clearly stands out in the left image of Fig. 15.5. Table 15.1 also shows how the maximum and minimum images in the SPM decrease with increased levels of noise smoothing.

With increased smoothing of the noise, the FWHM of the image increases from 5.0 to ∼ 72 pixels (Table 15.1). While the size of the largest positive and negative excursions remains approximately constant near 2000 and 3600 pixels, respectively, the *p*-value for excursions of that size occurring in the image changes dramatically. At the lowest level of smoothing, the chances of getting excursions of size 2061 or 3640 pixels under the Gaussian random field model with a FWHM of 5.0 are essentially zero (< 1.0 × 10<sup>−</sup>16). However, getting excursion regions of a similar size occurring under greater smoothing of the noise and a FWHM of 71.6 pixels is relatively common at 40 and 20%, respectively. These results demonstrate the strong dependence of *P(nmax* >= *k)* on the spatial correlation of the field.

### *15.3.2 Ground Water Pumping*

A general problem in a number of geoscience disciplines is the case where an ensemble of inputs is used in a calculation to provide a probabilistic result to a particular question. The calculation can be relatively simple or complex, but acts as a transfer function to transfer uncertainty in spatially distributed physical properties to uncertainty in an outcome of interest. Examples include groundwater models

**Fig. 15.5** Regions of excursion below a threshold of −2.5 (left column) and above 2.5 (right column) for images with noise smoothed using a filter of σ = 1.5 (top row) and σ = 7.5 (bottom row)

transferring uncertainty in hydraulic conductivity and recharge into radionuclide transport times; reservoir simulators transferring uncertainty in permeability and porosity into estimated recoverable oil; and simple spatial integration to transfer uncertainty in soil nutrient levels into estimating total crop yield for an agricultural field.

Here, a ground water example problem is used with the SPM approach to detect significant differences between two groups of an ensemble of spatial random fields of transmissivity. The ensemble is split into groups that create high results and all others. The SPM approach is used here to identify statistically significant features within the ensemble of input fields responsible for the specific results. This approach can be considered identification of the significant features in the random fields responsible for a specific result of a process that integrates across the entire field.

### **15.3.2.1 Problem Setup**

The ground water problem is motivated by the regulatory issue of impacts on a nearby wetland due to pumping from a planned water supply well. Well test criteria dictate that the pressure drop (drawdown) at a location 353 m to the northwest of the pumping well must be <2.00 m after pumping at a rate of 250 m<sup>3</sup> /h for 48 h. To simulate the aquifer test, a 12 × 12 km square domain, with zero-flux boundaries on the north and south and constant-head boundaries on the east and west is defined. Prior to pumping, the fixed head boundaries create steady state flow across the domain. A constant transmissivity, T, of 10.0 m<sup>2</sup> /h is assumed across the majority of the domain. This constant value is replaced by a heterogeneous T field within the center of the domain. The heterogeneous field is 3500 × 3500 m with 5 × 5 m cells. A large pumping well is set in the center of the domain.

The aquifer is confined in this area and the mean and spatial co-variance of the transmissivity can be estimated from other studies in aquifers of similar age and depositional history. The log10 values of transmissivity within the heterogeneous domain are simulated as a multiGaussian field with an isotropic Gaussian variogram with range 250 m and nugget of 5% of the sill. Transmissivity at the well location is considered known and provides the only conditioning point within the domain. A total of 200 realizations are created, and the 2D, confined, transient ground water flow equation is solved using finite differences on each realization:

$$\frac{\partial h(\mathbf{x}, \mathbf{y})}{\partial t} = \frac{1}{S(\mathbf{x}, \mathbf{y})} \cdot \left(\frac{\partial}{\partial \mathbf{x}} T(\mathbf{x}, \mathbf{y}) \frac{\partial h}{\partial \mathbf{x}}\right) + \left(\frac{\partial}{\partial \mathbf{y}} T(\mathbf{x}, \mathbf{y}) \frac{\partial h}{\partial \mathbf{y}}\right) \pm \mathcal{Q}(\mathbf{x}, \mathbf{y}),$$

where (x, y) indicates the spatial location, *h* (L) is the head (pressure), *t* is time and *Q* (L<sup>3</sup> /T) are sources or sinks—here the pumping rate at the well. Transmissivity, *T* (L<sup>2</sup> /T), is spatially heterogeneous within the central domain and for the calculations here, storativity, *S* (−) is set to a single value of 1.0 × 10−<sup>05</sup> across all locations in the aquifer. The initial conditions for the transient simulation are taken from a steady state head solution using the same input *T* field. Three example transmissivity realizations and maps of the resulting drawdowns after 48 h of pumping are shown in Fig. 15.6. Figure 15.6 demonstrates that the heterogeneous T field strongly impacts the resulting pressure response in a non-linear manner.

#### **15.3.2.2 Results**

For each ground water simulation, the drawdown at the test location (353 m NW of the pumping well) at 48 h is recorded and compared to the regulatory limit, *R*, of 2.00 m. The *T* realization is placed into one of two classes: those that meet the pressure drop limit, drawdown <= *R*, and those that exceed the limit. After 200 ground water simulations, the pixelwise mean and standard deviation within each

**Fig. 15.6** Three example transmissivity fields (left column) and the corresponding ground water drawdown levels after 48 h of pumping (right column). The color scales define log10 T in m2 /h and log10 drawdown in meters

class are calculated (Fig. 15.7). These four maps provide the input to a two-sample *t*-test to determine the difference between two means. The resulting map of *t*-statistics is the SPM. Here the *t*-statistics are smoothed with Gaussian kernel and transformed to *Z*-statistics and the *Z*-score SPM is shown in Fig. 15.8.

**Fig. 15.7** The mean (top row) and standard deviation (bottom row) fields for the transmissivity realizations that create drawdown <=2.00 m (left column) and drawdown >2.00 m (right column). The color scales show log10(T) in m2 /h

**Fig. 15.8** SPM for the difference between realizations. Full field is shown on the left and a zoomed in view of the central field on the right. Color scale is in standard deviations away from the mean of zero

**Fig. 15.9** Regions of excursion below a threshold of −2.5 (left) and above 2.5 (right)

The SPM is calculated at every pixel as the mean *T* value of the fields that created drawdowns exceeding the regulatory threshold, *R*, minus those that resulted in drawdowns less than or equal to the threshold: *T*>*<sup>R</sup>* − *T* <sup>≤</sup>*R*. This convention creates a positive value in the SPM in an area where higher *T* values are associated with realizations that created exceedance of *R* and negative values where higher *T* values created drawdowns ≤*R*. Figure 15.8 shows regions of positive and negative values, but the dominant anomaly is a high SPM value between the pumping well and the observation point to the northwest. For this example, 155 realizations (77.5%) created drawdowns ≤ *R* and 45 (22.5%) created drawdowns that exceeded *R*.

The SPM is truncated at a threshold of ±2.5σ and the excursion regions are defined (Fig. 15.9). The size of the largest excursions and the probability of them occurring under the mG model are shown in Table 15.2. The SPM has a FWHM of 111.5 m (22.3 pixels). The large positive excursion between the pumping well and the monitoring point is significant with a *p*-value near 1.0 × 10−<sup>04</sup> while the largest negative excursion is not.

Here the SPM approach also serves as a means of determining the regions of increased sensitivity of drawdown to the *T* values. As expected, when viewed from the perspective of influencing extreme drawdown values, the *T* values in the area between the pumping well and the monitoring point are significantly more important than other values in the *T* field. The remaining regions of excursion do not have any readily discernible connection to the ground water flow dynamics and are consistent with expected excursions in a mG field with this amount of


correlation. In practice, the large positive excursion region in the SPM can be used to focus resources for additional data collection, e.g., geophysical survey and/or additional wells.

### **15.4 Summary**

There is a large amount of work reported in the functional MRI literature on the detection of anomalies in spatially correlated fields using SPM. Apart from some work in astrophysics, this SPM work has generally been restricted to medical imaging. The body of knowledge around SPM and the statistical approaches developed for fMRI can be readily applied to problems in the earth and environmental sciences. This chapter reviews some of the major developments from the fMRI literature and demonstrates their application with an image anomaly detection problem and a ground water modelling problem. A strong advantage of SPM is that it directly addresses the challenge of enabling hypothesis testing, including calculation of the significance of the results, in spatially correlated fields.

The example problems chosen here emphasized defining the significance of the largest, positive and negative, anomaly in each SPM. The SPM framework also supports hypothesis testing on non-localized, "omnibus", features such as the maximum/minimum value of the SPM, the number pixels exceeding the threshold and the number of excursion regions within the SPM. Additionally, hypothesis testing of localized, "focal", features is also supported including hypothesis testing of the occurrence of any size excursion.

The example problems used here relied on the underlying images being realizations of mG fields, but that is not a requirement. It is the map of the test statistic values defining the differences between fields that is modelled as a mG field, and that flexibility makes SPM applicable to a very general set of problems as the mG model is a standard for differences between images. For example the same approach could be used to compare geologic models with discrete features. Future work will consider the application of other statistical tests within the SPM framework.

### **Appendix: Conditional Differences**

The *t*-test is a traditional measure of the difference between two means (e.g., Walpole and Myers 1989). Quite simply, the *t*-statistic is the difference between two values, at least one of which is a population or sample mean, normalized by the standard error of the mean:

$$\mu = \frac{\overline{X} - \mu}{s\_e} = \frac{\overline{X} - \mu}{s\sqrt{1/n}}$$

where *X* is a sample mean, *μ* is a population mean, *se* is the standard error of the mean which is the standard deviation of the observations, *s*, that make up the data vector *X* multiplied by the square root of 1 over the number of samples within *X*. The cumulative probabilities for any value of *t* are available from the Student's *t* distribution and require knowledge of the degrees of freedom, *ν*, within the test. For the analyses done here, *ν* is generally *n* − *1*.

In the case of comparing two sample means to each other at each location, i.e., *A* (*x*, *y*) and *B*(*x*, *y*), instead of comparing a sample mean to a theoretical population mean, the value of *se* must be calculated from both sample sets as:

$$S\_c = S\_p \sqrt{\frac{1}{n\_1} + \frac{1}{n\_2}}$$

where *n*<sup>1</sup> and *n*<sup>2</sup> are the number of images that were used in calculating the average maps *A* and *B* and *sp* is the average pooled standard deviation:

$$S\_p(\mathbf{x}, \mathbf{y}) = \sqrt{\frac{(n\_1 - 1)s\_1^2(\mathbf{x}, \mathbf{y}) + (n\_2 - 1)s\_2^2(\mathbf{x}, \mathbf{y})}{n\_1 + n\_2 - 2}}$$

Here we are assuming that *n1* and *n2* are constant for all locations and therefore not a function of *(x, y)*. The *t*-statistic image (map), based on the pooled standard deviation, is:

$$\sigma(\mathbf{x}, \mathbf{y}) = \frac{\Delta(\mathbf{x}, \mathbf{y})}{S\_p(\mathbf{x}, \mathbf{y})\sqrt{\frac{1}{n\_1} + \frac{1}{n\_2}}}$$

### **References**

Adler RJ, Hasofer AM (1976) Level crossings for random fields. Ann Probab 4:1–12

Adler RJ (1981) The geometry of random fields. Wiley, New York


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **Chapter 16 Water Chemistry: Are New Challenges Possible from CoDA (Compositional Data Analysis) Point of View?**

**Antonella Buccianti**

**Abstract** John Aitchison died in December 2016 leaving behind an important inheritance: to continue to explore the fascinating world of compositional data. However, notwithstanding the progress that we have made in this field of investigation and the diffusion of the CoDA theory in different researches, a lot of work has still to be done, particularly in geochemistry. In fact most of the papers published in international journals that manage compositional data ignore their nature and their consequent peculiar statistical properties. On the other hand, when CoDA principles are applied, several efforts are often made to continue to consider the log-ratio transformed variables, for example the centered log-ratio ones, as the original ones, demonstrating a sort of resistance to thinking in relative terms. This appears to be a very strange behavior since geochemists are used to ratios and their analysis is the base of the experimental calibration when standards are evolved to set the instruments. In this chapter some challenges are presented by exploring water chemistry data with the aim to invite people to capture the essence of thinking in a relative and multivariate way since this is the path to obtain a description of natural processes as complete as possible.

### **16.1 Water Chemistry Data as Compositional Data**

When geochemical data are analysed by using statistical methods, several units can be used to express concentrations and a first discussion of their compositional nature is reported in Buccianti and Pawlowsky-Glahn (2005). The usual units of measurement include milligrams per liter (mg/L), parts per million by weight (ppm), parts per billion by weight (ppb), millimole per liter (mmol/L), and

A. Buccianti CNR-IGG, Unit of Florence, Florence, Italy

© The Author(s) 2018 B. S. Daya Sagar et al. (eds.), *Handbook of Mathematical Geosciences*, https://doi.org/10.1007/978-3-319-78999-6\_16

A. Buccianti (✉)

Department of Earth Sciences, University of Florence, Via G. La Pira 4, 50121 Florence, Italy e-mail: antonella.buccianti@unifi.it

milliequivalent per liter (meq/L). The ppm and mg/L units are numerically equal if the density of the water sample is 1 g/cm<sup>3</sup> , as in pure water. Samples can be converted from mg/L to ppm by multiplying each component by the density of water. The term mmol/L indicates the number of ions or molecules in the water when multiplied by Avogadro's number (the number of molecules in a mole of material, 6.023 × 1023). The measure mg/L is converted to mmol/L by dividing by the atomic or molecular weight. To express concentration by meq/L (electrical charges are considered), mmol/L is multiplied by the charge of the ions. In each case the base of the calculus is given by the content of some chemical species referred to a given weight or volume then multiplied by a constant (atomic or molecular weight, electrical charges).

These types of data describe parts of some whole and even if proportions are expressed as real numbers, they cannot be interpreted, or even analysed, as real data. It is well known that this practice can lead to paradoxes and/or misinterpretations (e.g. intervals covering negative proportions, spurious correlations) already discussed a century ago (Pearson 1897), but mostly forgotten and neglected over the years (Chayes 1960).

No other ways are possible to compare different samples from dissimilar sites and times, as is usually required. Thus the compositional nature of the experimental data is an intrinsic property related to their origin (e.g. instrument calibration) and to the necessity of making comparisons to investigate the genesis of environmental variability. As directional (circular) observations (Fisher 1995) compositional data move in a constrained sample space called *simplex* (Aitchison 1986):

$$\mathbf{S}^{D} = \{ \mathbf{x} = [\mathbf{x}\_{1}, \mathbf{x}\_{2}, \dots, \mathbf{x}\_{D}] | \mathbf{x}\_{i} \}, > 0, \ i = 1, 2, \dots, D; \quad \sum\_{i=1}^{D} \mathbf{x}\_{i} = \kappa \tag{16.1}$$

where the *D* components of the vector *S<sup>D</sup>* are called parts (variables) of the composition. The value of κ depends of the units of the measurement or rescaling procedure, and usual values are 1 (proportions), 100 (%), 10<sup>6</sup> (ppm) or similar. Note that it is not necessary to have ∑*<sup>D</sup> <sup>i</sup>*= 1 *xi* = *κ* (closed data) to obtain compositional observations. In fact, a (row) vector **<sup>x</sup>** <sup>=</sup>½ *x*1, *x*2, ... , *xD* is a *D*-part composition when all its components are strictly positive real numbers and carry only *relative information*. This means that the message about what is occurring is mainly contained in the ratios between the parts since the numerical value of each variable by itself is not relevant. A recent thorough analysis of the "compositional problem" can be found in Pawlowsky-Glahn and Buccianti (2011) and Pawlowsky-Glahn et al. (2015). On the other hand interesting applications on water chemistry can be found in literature (e.g. Engle and Rowan 2013, 2014; Engle and Blondes 2014; Buccianti and Zuo 2016; Owen et al. 2016; Buccianti et al. 2018; Shelton et al. 2018) where the different potentialities of the family of the log-ratio transformations are differently exploited posing at the central point of the analysis the relativity of the values and the multivariate vision. The cited papers are not exhaustive but have been chosen since they successfully focus on the use of the isometric log-ratio transformation as a way to describe the dynamics of geochemical processes.

### **16.2 Isometric-Log Ratio Transformation: Is This the Key to Decipher the Dynamics of Geochemical Systems?**

### *16.2.1 Coordinates as Balances*

Water present below the land surface and running above it tells the history of the environment with which it has been in contact. Rainfall and snowmelt interact with the rock of the Earth surface and percolate through the soil zone where chemical reactions with gases, minerals and organic compounds take place. Chemical reactions occur because the composition of the water is not in equilibrium with the solid phases or the gaseous component (Kleidon 2010). Thus disequilibrium drives the reactions and solutes in the water are derived from the dissolution or leaching of the solid phases and from the dissolution of gases from the air or from the oxidation of organic matter. Most of the natural systems are open and according with Nicolis and Prigogine (1989) they are characterized by dissipative structures and presence of irreversible processes. Dissipative structures contain subsystems, which permanently fluctuate until the fluctuation becomes so strong that it breaks the original system to generate a new condition, more complex and characterized by a higher level of order. The dynamics of systems being far from equilibrium requires a continuous self-organization and to maintain this condition the energy flux from the environment is higher than required for the initial state and irreversible processes can be a source of order rather than chaos. Most of the geological systems are open and dynamic, characterized by a great number of components and develop in a nonlinear way far from equilibrium (Shvartsev 2009). Particularly interesting from this point of view is the water-rock system where also synergetic properties can be found, with respect to the thermodynamical equilibrium where elements (molecules) behave independently of one another (Shvartsev 2013).

The use of the isometric log-ratio coordinates (Egozcue et al. 2003) not only allows us to manage compositional data with classical statistical tools, but also could offer a powerful tool to probe the level of self-organization of a geochemical system as a whole. When coordinates are obtained by using the sequential binary partition method (Egozcue and Pawlowsky-Glahn 2005), guided by a geochemical criterion, the analysis of their frequency distribution may represent an interesting way to understand the laws governing randomness and variability. By taking into account this consideration, an improvement of the *balance dendrogram* (Pawlowsky-Glahn and Egozcue 2001) is here presented with the aim to investigate the behavior of aqueous systems.

The sample space of *D*-part compositional data, the simplex, being a subset of the real space *RD*, has a real Euclidean vector space structure (Billheimer et al. 2001; Pawlowsky-Glahn and Egozcue 2001; Buccianti and Magli 2011). This situation allows the representation of data in coordinates with respect to an orthonormal basis, for example following the Gram-Schmidt orthonormalization process or a Singular Value Decomposition (Egozcue et al. 2003). Since these methods often reveal coordinates not easy to interpret, *balances*, a specific type of orthonormal coordinates associated with groups of parts, have been proposed (Egozcue and Pawlowsky-Glahn 2005). This method is based on a sequential binary partition of a *D*-part composition into non-overlapping groups and when the procedure is geochemically guided it leads to coordinates easy to interpret. Moreover, it allows understanding of how the total variance is decomposed into marginal variances, thus pointing out the relationship between intra-group and inter-group compositional parts variability. For the *i*-*th* order of partition, the balance is

$$b\_{i} = \sqrt{\frac{r\_{i} \cdot s\_{i}}{r\_{i} + s\_{i}}} \log \frac{\left(\prod\_{\mathbf{x}\_{i \in G\_{il}}} \mathbf{x}\_{j}\right)^{1/r\_{i}}}{\left(\prod\_{\mathbf{x}\_{i \in G\_{il}}} \mathbf{x}\_{l}\right)^{1/s\_{i}}} \tag{16.2}$$

where *ri* and *si* are the number of parts in the groups of numerator (*Gi1*) and denominator (*Gi2*), respectively. As we can see, the balance is defined as the natural logarithm of the ratio of geometric means of the parts in each group, normalized by the coefficient needed to obtain unit length of the vectors of the basis.

### *16.2.2 Behavior of Self-organizing Systems and CoDA Phylosophy*

A general characteristic of self-organizing systems is robustness and resilience (Dakos et al. 2014; Dai et al. 2015). This means that they are relatively insensitive to perturbations or errors, and can show a strong capacity to restore themselves after changes (Scheffer et al. 2009, 2012). One reason for this fault-tolerance is the redundant, distributed organization so that the non-damaged regions can usually make up for the damaged ones. Within certain limits, another reason for the intrinsic robustness is that self-organization is facilitated by randomness, fluctuations or "noise" while the stabilizing effect of feedback loops guarantee resilience. The presence of feedback mechanisms generates systems that can be responsible for their own maintenance, and thus largely independent from the environment. Although in general there will still be exchange of matter and energy between systems and surroundings, the organization is determined purely internally. Thus the system is thermodynamically open, but organizationally closed. Organizational closure turns a collection of interacting elements into an individual, coherent whole. This whole has properties that arise out of its organization that can be described by the probability laws that govern the relative behaviour of its elements (van Rooij 2013). From this point of view CoDA theory appears to capture the philosophy of this condition and the analysis of the shape of the frequency distribution of isometric coordinates should be the adequate tool (Allegre and Lewin 1995; Seely et al. 2012; Holden and Rajaraman 2012; Buccianti and Zuo 2016).

As reported in Scheffer et al. (2012) the probability density distribution of some variables describing the state of a system can be used to estimate how the potential landscape is reflecting its stability properties. The shape of the probability density function indicates where the data are more aggregated and which laws are governing the variability, giving us fundamental information about the genesis of randomness (Agterberg 2014). In our case it will be the shape of the frequency distribution of isometric log-ratio coordinates representing some geochemical process that will inform us about dynamic properties of the system. In Fig. 16.1 some examples of a non-equilibrium dynamics are reported (Scheffer et al. 2009). Conditions represented in (a) are far from a bifurcation point. The pothole in the potential line corresponds to an area where data tend to aggregate in the density probability distribution function. Here resilience is large since the basin of attraction is wide and the rate of recovery from perturbations is relatively high. If the system is stochastically forced, the resulting dynamics will be characterised by low correlation between states at subsequent time intervals. In (b) the system is closer to the transition point and resilience decreases due to the shrinking of the attraction basin and the low rate of recovery from small perturbations. Here the slight depression could be related to presence of bimodality indicating presence of alternative states. In this case the system in a stochastic environment will have a long memory for perturbations and its dynamics will be governed by high variance and stronger correlations between subsequent states.

**Fig. 16.1** Example of non-equilibrium dynamics (from Scheffer et al. 2009, modified). The pothole in the potential line of diagram **a** corresponds to an area where data tend to aggregate in the density probability distribution function. The slight depression in **b** could be related to presence of bimodality indicating presence of alternative states (Scheffer et al. 2012)

### **16.3 Improving CoDA-Dendrogram: Checking for Variability, Resilience and Stability**

The chemical composition of groundwaters from the Arezzo basin aquifer (Tuscany, central Italy) was analysed, as an application example, to obtain information about the dynamics of the aqueous geochemical system. The Arezzo Basin (Fig. 16.2), formed since Upper Pliocene, is a structural depression bordered to the North and to the East by the Pratomagno and Chianti belts, respectively, and to the South and to the East by two tectonic lineaments (Val d'Arbia-Val Marecchia transversal and Chitignano normal faults). Along these tectonic discontinuities CO2-rich manifestations either seep out or are exploited by private companies down to the depth of 1000 m. Three main aquifers are recognized: (i) a relatively deep aquifer hosted in Tertiary sandstone formations; (ii) an intermediate aquifer hosted in Quaternary fluvio-lacustrine sediments and (iii) a shallow aquifer in recent alluvial sediments. The available geochemical data-base consists of about 500 samples that were collected in different dry and rainy seasons in recent years from 80 wells diffused in all the basin area. Depth of the sampling is, unfortunately, not always known and few differences can be related to seasonal changes. Physical parameters (temperature and electrical conductivity), major, minor and trace dissolved species (pH, Ca, Mg, Na, K, NH4, HCO3, SO4, NO3, NO<sup>2</sup> , Cl, Br, F and heavy metals), oxygen and hydrogen isotopes in the water molecules and dissolved gases (including 13C-CO2) were analyzed. On the basis of Total Dissolved Solids (TDS) the waters from Arezzo aquifer can be considered mainly oligomineral and medium-mineral, whereas mineral waters are almost exclusively associated with

**Fig. 16.2** The hydrographic system of the Arezzo basin (Tuscany, central Italy) (http://sit.comune. arezzo.it/normativa/index.php?normativa=\_ps&mappa=ps\_b11a)

CO2-rich wells. From a classification point of view, Ca(Mg)-HCO3 is by far the most representative geochemical facies, followed by Na(K)-HCO3, Ca(Mg)-SO4 and Na(K)-Cl types. It is noteworthy to point out here that the Na(K)-HCO3 waters, whose origin is related to the presence of CO2-rich waters that favor cation exchange processes with clay minerals contained in the sedimentary formations, are aligned along the Val d'Arbia-Val Marecchia transversal tectonic system.

In Table 16.1 the sequential binary partition process to construct the isometric log-ratio coordinates is reported. The first coordinate could represent the balance between the most important chemical reactions involving carbonatic and silicatic rocks (Ca2+, Mg2+, Na+, K+, HCO3 <sup>−</sup> and H+) versus elements and chemical species whose sources could be different, including pollution (Cl<sup>−</sup>, SO4 <sup>−</sup>, NO3 −). The second coordinate is an analysis inside the carbonatic and silicatic cycle, balancing cations and anions. The third compares the behaviour of the involved bivalent versus monovalent elements while the fourth and the fifth compare their relative behaviour. The sixth coordinate analyses the anions giving us information about the pH water conditions. Finally, the remaining coordinates investigate the behaviour of variables whose source may be related to pollution. Considering Cl<sup>−</sup> in absence of atmospheric cyclic salts and evaporates about 30% of its amount is related to pollution, 54% in case of SO4 <sup>2</sup>−, while for nitrate the most important anthropogenic sources are septic tanks, application of nitrogen-rich fertilizers to turf grass, and intensive agricultural processes (Berner and Berner 1996; Liu et al. 2011; Menció et al. 2016).

As we can see variance is higher for the first balance comparing natural and anthropic processes, and the last one, comparing SO4 <sup>2</sup><sup>−</sup> and NO3 <sup>−</sup> whose ratio variability is a further witness of the presence of numerous sources/fluctuations. A first result here reveals that when elements are more related to natural weathering processes their balance variability appears to be reduced, probably indicating that the same processes have been working through time in a similar way. By taking into account the previous discussion about the dynamics of geochemical systems more information should be obtained by the analysis of the frequency distribution of the balances.

To achieve this aim in Fig. 16.3 an improved version of the *balance dendrogram* is reported where the original boxplots (Pawlowsky-Glahn and Egozcue 2011) are associated with the frequency distribution of the coordinates. Histograms have the same horizontal and vertical scale so they are comparable. Red line is related to the Gaussian distribution, black treated line to the Kernel density estimation.

Application of several normality tests indicates that under no circumstances the Normal distribution can be considered as model for the log-ratio coordinates; the consequence is that the log-normal model cannot be used to describe ratios between parts or group of parts. In most of the cases it appears to be due to some bimodality or to the presence of a heavy tail in the right-hand part of the distribution. The presence of power laws is associated with complex systems composed of processes that interact to self-organize their behavior across multiple temporal and/or spatial scales. Both fractals and multifractals are commonly associated with local self-similarity or scale-independence, generally leading to power-law relations



**Fig. 16.3** Balance dendrogram (Thió-Henestrosa et al. 2008) with associated histograms. Red line corresponds to the Gaussian model, black treated line to the Kernel density estimation. The length of the vertical bar represents the proportion of the sample total variance

(Agterberg 2014). On the other hand the lognormal shape represents a special condition in which the interdependencies among processes are minimized or absent and repeated fragmentation (or dilution) dominates. As we can see in Fig. 16.3 the presence of heavy tails characterizes coordinates that mainly balance weathering of silicate and carbonates (K+, Na+, Mg2+, Ca2+, H+, HCO3 <sup>−</sup>) versus other environmental processes (NO3 <sup>−</sup>, SO4 <sup>−</sup>, Cl−). Moreover, considering the internal partition of the previous balances, K+/Na+, Mg2+/Ca2+ and, in particular, NO3 <sup>−</sup>/SO4 − ratios repeat this type of behavior.

The use of the complementary distribution function reveals the presence of power laws more clearly. In this plot, reported in Fig. 16.4, if *X* has a power law distribution the behavior of the *Prob[X* ≥ *x]* will be a straight line (Mitzenmacher 2004). As we can see, linear models can well describe several portions of curves for all the coordinates. This condition asks for multifractality perhaps associated to the space-time heterogeneity of the aquifer structure. Here a sudden change in the number of data with given concentration values is expected, particularly for pollution processes (Agterberg 2014). The fractal dimension of the phenomena, related to the slope of the straight lines, indicates how much more often there are low differences between the data rather then high differences.

On the whole the aquifer system appears to be governed by an interactiondominant dynamics but it does not present a clear multimodality (or bimodality) that could be associated to different states. By considering Fig. 16.1 and the information deduced by the shape of the frequency distribution (Figs. 16.3 and 16.4) the aquifer could be associated with a sufficient resilience and recovery state (Scheffer et al. 2009, 2012). Of notice here is that the most important contribution to variability appears to be related to chemicals such as NO3 <sup>−</sup> and SO4 <sup>−</sup> suggesting the weight and the intermittency of the anthropic pressure. The multifractality revealed in Fig. 16.4 could indicate that in the dynamical system the energy

**Fig. 16.4** Complementary distribution function to reveals the presence of power laws. If *X* has a power law distribution the behavior of the *Prob[X* ≥ *x]* will be a straight line (Mitzenmacher 2004)

dissipation cannot be neglected and that extended areas (intervals) of low fluctuations intermittent with small areas of extremely large fluctuations are to be expected. Moreover, the system as a whole is undergoing a non-linear dissipation with the energy interchange on different scales.

### **16.4 Conclusions**

Starting from Garrels and Christ (1965) equilibrium in the water-rock system is usually analysed through the application of thermodynamic methods. In this context the statistical analysis of water concentrations, opportunely transformed into isometric logratio coordinates, could be an effective approach to understand where the randomness in nature comes from (Agterberg 2014) and if equilibrium conditions are really encountered.

The frequency distribution of the ratio of the compositional parts of Arezzo aquifer chemistry exhibits an overlapping between log-normal and power-law probability distributions when silicate and carbonate weathering (K+, Na+, Mg2+, Ca2+, H+, HCO3 <sup>−</sup>) is balanced versus other environmental processes (NO3 <sup>−</sup>, SO4 −, Cl−). Similar results are obtained when the partition to generate new balances is applied to the previous group of parts (NO3 <sup>−</sup> versus SO4 <sup>−</sup>, K<sup>+</sup> versus Na<sup>+</sup> or Mg2+ versus Ca2+). The result indicates a system subjected to nonlinear compositional changes due to presence of feedback effects attributable in a porous medium to change in porosity causing a remarkable change in permeability, in the pore-fluid flow and in the chemical-species concentration (Zhao 2014). Since thermodynamic equilibrium represents a homogeneous distribution of the parts, the obtained results indicate that the system is able to create and maintain a given amount of gradient, generating heterogeneity. However no clear multimodality is present and for the span of time here analysed different steady states (basins of attraction for concentration values) have not yet clearly emerged. Thus, from a compositional point of view, the system could be characterised by sufficient resilience and recovery rate from disturbances since the dissipative behaviour appears to be able to adsorb fluctuations. New progress would be made in this direction by exploiting the capacity of CoDA to capture the interdependence of concentration values, thus describing the water system and the surrounding as a whole, as in reality.

**Acknowledgements** This research was supported by the University of Florence (2016 funds) and by the GEOBASI project financed in 2015 by Tuscany Region through CNR-IGG. Frits Agterberg and B.S. Daya Sagar are warmly thanked for their support as well as IAMG for the 2003 Felix Chayes Prize and the constant sustain to my research activity. Lisa Merli helped me for the English language.

### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **Chapter 17 Analysis of the United States Portion of the North American Soil Geochemical Landscapes Project—A Compositional Framework Approach**

### **E. C. Grunsky, L. J. Drew and D. B. Smith**

**Abstract** A multi-element soil geochemical survey was conducted over the conterminous United States from 2007–2010 in which 4,857 sites were sampled representing a density of 1 site per approximately 1,600 km<sup>2</sup> . Following adjustments for censoring and dropping highly censored elements, a total of 41 elements were retained. A logcentred transform was applied to the data followed by the application of a principal component analysis. Using the 10 most dominant principal components for each layer (surface soil, A-horizon, C-horizon) the application of random forest classification analysis reveals continental-scale spatial features that reflect bedrock source variability. Classification accuracies range from near zero to greater than 74% for 17 surface lithologies that have been mapped across the conterminous United States. The differences of classification accuracy between the Surface Layer, A- and C-Horizons do not vary significantly. This approach confirms that the soil geochemistry across the conterminous United States retains the characteristics of the underlying geology regardless of the position in the soil profile.

E. C. Grunsky (✉) Department of Earth and Environmental Sciences, University of Waterloo, Waterloo, ON, Canada e-mail: egrunsky@gmail.com

E. C. Grunsky China University of Geosciences, Beijing, China

L. J. Drew United States Geological Survey, Reston, VA, USA

D. B. Smith United States Geological Survey, Denver, CO, USA

© The Author(s) 2018 B. S. Daya Sagar et al. (eds.), *Handbook of Mathematical Geosciences*, https://doi.org/10.1007/978-3-319-78999-6\_17

**Electronic supplementary material** The online version of this chapter (https://doi.org/10.1007/ 978-3-319-78999-6\_17) contains supplementary material, which is available to authorized users.

### **17.1 Introduction**

A continental-scale soil geochemical survey was conducted over the conterminous United States from 2007 to 2010 by the U.S. Geological Survey (Smith et al. 2011, 2012, 2013, 2014). The survey collected samples at 4857 sites (Fig. 17.1), representing a density of 1 site per approximately 1600 km<sup>2</sup> . The sampling protocol included, at each site, a sample from a depth of 0–5 cm (referred to as the surface soil for the remainder of this paper), a composite of the soil A horizon (the uppermost mineral soil), and a sample from the soil C horizon (generally the partially weathered parent material). If the top of the C horizon was at a depth greater than 1 m, a sample over a 20 cm interval was collected at a depth of approximately 1 m.

Studies on the geochemistry of two transects (east-west and north-south) across the United States and Canada, conducted as pilot studies in preparation for the continental-scale survey (Smith 2009; Smith et al. 2009) showed variability of soil geochemistry and mineralogy along both directions (Garrett 2009; Eberl and Smith 2009; Woodruff et al. 2009). As well, Drew et al. (2010) studied the two transects and demonstrated that the geochemical variability of soil is also closely associated with ecoregions (CEC 1997), which reflect continental scale features such as soil, landform, major vegetation types and climate. These studies indicate that the soil geochemistry is useful for mapping both geological and ecological domains.

Soil geochemistry, from a geological context, reflects a range of mineralogy, as a function of weathering of different parent materials, along with organic content due to biological activity. Ideally, soil geochemistry will represent underlying parent material and processes associated with the modification of those parent materials through comminution, weathering, ground water activity and biogenic processes. Grunsky et al. (2012, 2014) smf Mueller and Grunsky (2016) demonstrated that the

**Fig. 17.1** Soil sample sites over the conterminous United States. Samples were taken at the (0–5) cm layer, the A- and C-horizons

geochemistry of lake sediment and glacial till in northern Canada can be used to predict the underlying lithologies. As part of the North American Soil Geochemistry Landscape Project (Smith et al. 2009), Grunsky et al. (2013) used soil geochemistry collected over the Maritime Provinces of Canada and the northeast United States to demonstrate that A-, B- and C-horizon soils geochemistry is useful for mapping the underlying lithologies. More recently, Grunsky et al. (2017) have shown that geochemistry of surficial soils can identify and classify underlying crustal blocks across the Australian continent, even after extended periods of weathering, transport and reworking.

The approach is based on the use of training sets of representative lithologies. Unfortunately, there are no continental-scale lithologic maps or representative training sets which can be used for predictive bedrock lithologic mapping in Canada or the United States. Sayre et al. (2009) classified the land surface of the conterminous United States according to surficial materials lithology, terrestrial ecosystems and isobioclimate. Isobioclimatic zones were subdivided into thermotypes, (temperature) and ombrotypes (moisture). It follows that soil geochemistry is a proxy for processes controlled by climatic factors. A key question that arises from this is can any of these processes be identified uniquely in the soil geochemistry and, if so, how can these processes be identified in terms of spatial continuity and distinctive chemistry? Drew et al. (2010) studied two transects across the US and demonstrated that the soil geochemistry is closely tied to zones that define the terrestrial ecosystems intersected by these transects. The objective of the current study is to address this question through the use of multivariate statistical analysis and Bayesian-based classification in conjunction with geostatistical methods that accurately describe processes in terms of distinctive geochemistry and spatial continuity.

### **17.2 Methods**

### *17.2.1 Sampling and Analysis*

The soil samples were analysed for geochemistry and mineralogy as described by Smith et al. (2011, 2012, 2013, 2014). The samples were air-dried and sieved to <2 mm after which the material was crushed in a ceramic mill prior to chemical analysis. Concentrations of Ag, Al, Ba, Be, Bi, Ca, Cd, Ce, Co, Cr, Cs, Cu, Fe, Ga, In, K, La, Li, Mg, Mn, Mo, Na, Nb, Ni, P, Pb, Rb, S, Sb, Sc, Sn, Sr, Te, Th, Ti, Tl, U, V, W, Y, Zn in all the soil samples (14,434) were determined using a near-total digestion using HCl-HNO3-HClO4-HF followed by inductively coupled plasma-mass spectrometry and inductively coupled plasma-atomic emission spectrometry. Mercury values were obtained using cold-vapor atomic absorption spectrometry following dissolution in a mixture of HCl and HNO3 and Se was determined by hydride-generation atomic absorption spectrometry (HGAAS) following dissolution in a mixture of HNO3, HF, and HClO4. Arsenic was also determined by HGAAS following fusion in a mixture of sodium peroxide and sodium hydroxide at 750 °C. Total carbon was determined by combustion. Smith et al. (2013) provides details on the analytical methods and quality control protocols. Silicon was not determined.

All A-horizon and C-horizon samples (9575) were analysed by X-ray diffraction, and the percentages of major mineral phases were calculated using a Rietveld refinement method. Splits of the <2 mm fraction were used for analysis. Complete details of the technique and quality control protocols are provided in Smith et al. (2013).

### *17.2.2 Data Screening and the Compositional Nature of Geochemical Data*

Geochemical analyses require screening and adjustment prior to any application of statistical methods and interpretation. A generalized sequence of data screening and adjustment strategies is documented in Grunsky (2010). The data were evaluated and analysed using the R programming and statistical environment (R Core Team 2013).

Major element concentrations, reported as percentages, were converted to ppm, by multiplying the values by a factor 10,000. Summary statistics for the data are given in Smith et al. (2013). The data were screened to determine the number of values that were reported at less than the lower limit of detection. Data that are reported at less than the lower limit of detection are termed as "censored". Censored data, when used in the application of statistical procedures, can influence estimates of mean and variance and therefore a replacement value that accurately reflects an estimate of the true mean is preferred. Furthermore, geochemical data are, by definition, compositions and as such the issue of closure becomes important (Aitchison 1986). Egozcue et al. (2003) describe various transformations that assist in evaluating data that are constrained by the effect of closure. For censored geochemical data, replacement values can be determined using the several methods based on maximum likelihood estimates of replacements values (Palarea-Albaladejo et al. 2014). Elements in which >80% of the values were censored were dropped from further evaluation, which included Ag, Cs and Te.

The data were also screened for sample sites where a large number of elements were reported at less than the lower limit of detection (<LLD). In the surface soil, 8 sites were found to have more than 25 elements reported at <LLD (3 from Florida). For the A horizon, 2 sites, all from Florida, were found to have more than 25 elements reported at <LLD. For the C horizon, 3 sample sites, in Florida, were found to more that have more than 25 elements reported at <LLD. These sites were dropped from further evaluation.

Summary statistics for the elements are provided by Smith et al. (2013, 2014). The remaining 43 elements: Al, As, Ba, Be, Bi, total C, Ca, Cd, Ce, Co, Cr, Cs, Cu, Fe, Ga, Hg, In, K, La, Li, Mg, Mn, Mo, Na, Nb, Ni, P, Pb, Rb, S, Sb, Sc, Se, Sn, Sr, Th, Ti, Tl, U, V, W, Y, Zn were then evaluated for the estimate of replacement values for those results that were reported at less than the lower limit of detection. The method of nearest neighbour replacement estimates (R package: zCompositions, function **lrEM**) was used on the censored data (Palarea-Albaladejo et al. 2014). The adjusted data were then used for subsequent multivariate statistical analysis.

### *17.2.3 Integration of Land Surface Parameters with Soil Geochemistry*

Land surface maps of the conterminous United States (Sayre et al. 2009) were used to test the effectiveness of the soil geochemistry for revealing information on surficial materials lithology, terrestrial ecosystems and isobioclimate. Isobioclimatic zones were subdivided into thermotypes, (temperature) and ombrotypes (moisture). In this study, only the surface lithologies were studied in further detail. The results of the evaluation of the soil geochemistry in the context of terrestrial ecosystems, thermotypes and ombrotypes will be provided at a later time.

The maps were obtained as raster images with a pixel resolution of 1 km and a geodetic projection of decimal degrees using the North American Datum of 1983 (NAD83). These images were re-projected to the Lambert Conformal Conic projection using the following parameters (Spheroid—GRS 1980; Central Meridian: 96° West; Standard Parallels of 32° and 44°; Latitude of Origin: 38°; False Eastings and Northings of 0 m). This projection was used throughout the study.

The Quantum Geographic Information Systems (QGIS) (QGIS Development Team 2016) was used for the integration of various data sources and the geospatial rendering of the results. Within QGIS, two procedures were used from the Geospatial Data Abstraction Library (GDAL) procedure, "**warp (reprojection)**" and "**point sampling too**l". The map images were initially re-projected to the Lambert Conformal Conic (**lcc**) projection listed above using the "**warp**" procedure. The point dataset of the geochemical sampling sites were also reprojected from latitude/longitude coordinates to the **lcc** projection. The **lcc** image of the surface lithology was then sampled at the geochemical site coordinates using the "**point sampling tool**" and the surface lithology value was integrated into the geochemical database. This methodology was carried out for the other land surface maps (terrestrial ecosystems, surface lithologies, thermotypes and ombrotypes). The values of these features were integrated into the soil geochemistry dataset for further evaluation. It should be noted that the maps produced by Sayre et al. (2009) are generalizations and expressed at a resolution of 30 m (landforms, topographic moisture), 1 km (biogeographic regions) and 15 km for the surface lithology. It is possible that the class defined at any given point on the maps produced by Sayre does not correspond with the surface lithology, biogeographic, landform or topographic classes that were encountered during the soil survey sampling program.

For geospatial rendering purposes (interpolation), the Level 1 Ecology map of the conterminous United States was used to create a grid with a cell size of 40 km × 40 km.

Interpolation of principal component scores, posterior probabilities and measures of typicality were carried out using a geostatistical framework. The gstat package (Pebesma 2004) was used to generate and model semi-variograms with sufficient parameters to generate interpolated images through kriging. The cell size used for image interpolation was chosen as 40 km, the approximate spacing of the site sampling locations.

### *17.2.4 Process Discovery—Empirical Investigation of Soil Geochemistry*

After screening the data for detection limit issues and missing values, the geochemical data were then subjected to an empirical investigation in which the assumptions about the data are minimal. To deal with the effect of closure, the data for 41 elements (Al As Ba Be Bi Ca Cd Ce Co Cr Cu Fe Ga Hg In K La Li Mg Mn Mo Na Nb Ni P Pb Rb S Sb Sc Se Sn Sr Th Ti Tl U V W Y Zn) were log-centred transformed after which a principal component analysis (PCA) was carried out using the methodology of Zhou et al. (1983) and Grunsky (2001). PCA was carried out on the entire set of multi-element data for the surface soil, the A and C horizons combined. PCA was also carried out on the multi-element data individually for the surface soil, A and C horizons. The rationale for this is based on enhancement of the multi-element signature for each layer rather than a principal component signature derived from the combined layers. The principal component biplots and corresponding maps of the component scores were subsequently generated for the surface soil, the A- and C-horizons independently. The biplots and interpolated maps provide insight into the orthogonal linear relationships that can reflect dominant geochemical processes that are influenced by mineral stoichiometry. The three soil layers were evaluated together in order to show any possible relationships between the two soil horizons (A and C) and the surface soil layer. To assist with insight into processes that influence the relationship of the elements and patterns of the scores of the observations, the loadings of the elements were coloured according to the classification of Goldschmidt (1937) into lithophile. siderophile or chalcophile affinity Elements associated with the atmophile affinity were not considered in this study.

### *17.2.5 Process Validation—Modelled Investigation of Soil Geochemistry*

Using the classified information derived from the land surface maps of Sayre et al. (2009), the geochemical data were used to establish the ability to predict these classifications using a cross-validation approach in which the data are repeatedly sub-sampled as part of the classification process.

Previous studies (Grunsky et al. 2012, 2014) demonstrated that the use of multivariate statistical methods was able to classify bedrock lithologies based on lake sediment and glacial till geochemical data using discriminant analysis. The methodology employed the results of principal component analysis (described above), followed by an analysis of variance and the application of linear discriminant analysis (Venables and Ripley 2002) to determine which principal components were best at classifying and predicting the bedrock lithologies. This approach relies on having a sufficient number of degrees of freedom and homogeneity of covariance between the classes of the training sets. An alternative to linear discriminant analysis is quadratic discriminant analysis (Venables and Ripley 2002), which compensates for the classes where the condition of homogeneity of covariance cannot be met. The results of applying these methods includes measures of posterior probability in which each site is assigned a measure of probability of belonging to each of the classes and the class with the highest posterior probability is assigned to that site. Posterior probabilities are also compositions, as the sum of the probabilities for all of the classes for each site must sum to 1.0 and are, therefore, compositional in nature.

Both methods were tested for discriminating between the surface lithologies in this study. However, a comparison of results between linear discriminant and quadratic discriminant analysis showed little difference in the results and some classes had to be omitted because of an insufficient number of training sites.

To overcome some of the problems of applying classification methods in previous studies, we employed the statistical method, Random Forests (Breiman 2001) as employed by Harris and Grunsky (2015) and used as part of a remote predictive mapping strategy (Harris et al. 2008). The Random Forest method is based on the construction of classification trees (Venables and Ripley 2002, Chap. 9) in which nodes (splits in classes) are based on continuous variables from which a series of branches in the tree will correctly classify (categorical variables) all of the data. The Random Forest method "grows" many trees and each tree provides a classification. Each classification is termed a vote and a classification is assigned to the forest with the most votes. A useful description of the methodology is provided in Breiman and Cutler (2016). The function "**randomForest**", herein referred to as "RF", from the package **randomForest** (Breiman and Cutler 2016) was used for the analysis.

For each tree that is created, a training set of approximately one-third of the data is drawn, with replacement and are left out of the sample population. This is known as the out-of-bag (oob) data and is used to get a running unbiased estimate of the classification error, as trees are added to the forest. Variable importance is also determined from the out-of-bag data. For each tree, all of the data are applied to the tree and "proximities" are determined for each pair of cases. If two cases occur at the same node, then the proximity of that pair is increased by one. When all of the trees have been estimated, the proximities are normalized by dividing by the number of trees. These proximities can be used for replacing missing data, identifying outliers and creating lower dimensional views of the data. Each tree is constructed from bootstrapping the original sample population and about one third of the data are left out from each bootstrap sample and not used in tree construction but are then classified from the tree created from the other two thirds of the sample population. An unbiased estimate of the classification error is determined from each case that is oob and did not classify correctly. Variable importance is determined by comparing oob classification results and the non-oob classification results after random permutations of each of the variables. Another measure of variable importance is determined by the Gini measure that is determined by the number of splits that are made for a given variable over all of the trees in the forest. Variables do not need to be pre-selected using techniques such as analysis of variance as the RF procedure determines which variables are the best classifiers.

Maps of the normalized votes, which are equivalent to posterior probabilities, can be created using geostatistical methods such as kriging. However, since the posterior probabilities are compositions and sum to 1.0, these values must be logratio transformed, followed by subsequent co-kriging, and then back transformed for subsequent geographic rendering (Pawlowsky-Glahn and Egozcue 2015; Mueller and Grunsky 2016). Instead, maps of the posterior probabilities for each of the classes were created by posting the sample sites with points and colours. An alternative to this would be to consider the un-normalized (raw) votes as independent and carry out kriging on these estimations. The results of these interpolations are provided in the Supplementary Annex.

### **17.3 Results**

### *17.3.1 Process Discovery—Principal Component Analysis*

A logcentred transform was applied to the adjusted data after which a principal component analysis was carried out. An examination of an ordered plot of eigenvalues in the form of a screeplot (Jolliffe 2002) are shown in Fig. 17.2a–d for (a) all of the data, (b) Surface Soil, (c) A horizon only and (d) C horizon only. Figure 17.2a–d display two important inflection points; at PCs 3 and 9. The first three eigenvalues define the dominant structure in the data and the next 5 display lesser but significant structure also. This is also expressed numerically in Table 17.1 where the first 10 eigenvalues are listed along with the associated cumulative contribution to the structure in the data. As shown in the screeplots of Fig. 17.2, a comparison of the first four successive eigenvalues between the C-horizon,

**Fig. 17.2 a**—Screeplot of eigenvalues of the soil geochemistry for the combined Surface Soil (0–5) cm layer, the A- and C- horizons, from the application of a principal component analysis to logcentred transformed data. **b**—Screeplot of eigenvalues of the soil geochemistry for the Surface Soil (0–5) cm layer from the application of a principal component analysis to logcentred transformed data for the top layer only. **c**—Screeplot of eigenvalues of the soil geochemistry for the A-horizon from the application of a principal component analysis to logcentred transformed data for the A-horizon only. **d**—Screeplot of eigenvalues of the soil geochemistry for the C-horizon from the application of a principal component analysis to logcentred transformed data for the C-horizon only

A-horizon and Surface Soil is slightly greater for the C-horizon. This implies that the linear combinations of the elements are stronger for the C-horizon than for the other two. Eigenvalues with values less than 1 and are interpreted to represent under-sampled processes or random effects (noise).

The largest eigenvalues signify that the linear combinations of the elements for these components are significant and defines "structure" in the data. This structure can be interpreted as the influence of stoichiometric control of mineralogy.


**Table 17.1** Principal Component Analysis results for logcentred transformed soil geochemistry

### *17.3.2 PCA of the Combined Surface Soil, A-Horizon, C-Horizon*

Figures 17.3a, and 17.4a shows biplots (PC1-PC2 and PC2-PC3) for the principal component scores and loadings for the combined data from the surface soil, A- and C-horizons Table 17.1 shows that the first three principal components for the combined data (All Layers) account for 50.6% of the overall variation in the data.

Figure 17.3a shows the mass of data points defined by two vertices: (1) Cr-V-Ni-Co-Fe-Sc-Mn-P-Zn; (2) Hg-In-Ti-Se-Mo-As-Sb-Sn-Bi (chalcophile) and a trend of element associations: Mg-Ca-Na-Sr-Ba-K-Be-Rb-Tl that are inversely associated with the vertex defined by (2) above. The chalcophile elements are grouped along the +PC1 axis. Siderophile elements are associated with the +PC2 axis and the lithophile elements are distributed around the ±PC1/−PC2 axes and the −PC1/+PC2 axes.

Figure 17.4a shows the three sets of data (Surface Layer, A- and C-horizon) combined onto a biplot of PC2–PC3. The PC scores along the PC2 axis define a contrast between mafic (+ scores) and felsic (−scores) source material. Siderophile (Fe, Co, Ni), lithophile (Cr, V, Sc, Ti) and chalcophile elements (Cu, In) are associated along the +PC2 axis and lithophile elements (Rb, K, Tl, Ba, Th, La, Be, Ce) are concentrated along the −PC2 axis.

**Fig. 17.3 a**—Biplot of principal components 1 and 2 for the soil geochemistry for the combined Surface Layer, A, and C horizon soil geochemical data based on a log centred transform. The colours and symbols represent the surface soil and the soil A and C horizons. **b**—Biplot of principal components 1 and 2 for the Surface Soil geochemistry data based on a log centred transform. **c**—Biplot of principal components 1 and 2 for the A-horizon soil geochemistry data based on a log centred transform. **d**—Biplot of principal components 1 and 2 for the C-horizon soil geochemistry data based on a log centred transform

An association of chalcophile elements (Cd, S, Sb, As, Hg, Pb) occurs along the +PC3 axis with a corresponding concentration of sample sites associated with the surface layer and A-horizon, most likely representing complexing with organic rich soils. PC scores for the C-horizon are concentrated along the ±PC2 axis, which may represent a range of source material from mineral soils that are low in organic material (−PC3) to soils that are rich in organic material or derived from shales/ weathered materials (+PC3).

**Fig. 17.4 a**—Biplot of principal components 2 and 3 for the soil geochemistry for the combined Surface Soil, A, and C horizon soil geochemical data based on a log centred transform. The colours and symbols represent the surface soil and the soil A and C horizons as shown in Fig. 17.3a. **b**—Biplot of principal components 2 and 3 for the top layer soil geochemistry data based on a log centred transform. **c**—Biplot of principal components 2 and 3 for the A-horizon soil geochemistry data based on a log centred transform. **d**—Biplot of principal components 2 and 3 for the C-horizon soil geochemistry data based on a log centred transform

### *17.3.3 PCA of the Surface Soil, A-Horizon, C-Horizon*

The biplots of Fig. 17.3a–c for all of the data, the surface soil data and the A-horizon data, show similar patterns in terms of the relationships of the elements with each other and the shape of the data cloud for the projection of the principal component scores onto the PC1 and PC2 axes. The biplots exhibit a range of lithophile loadings that define materials derived from mafic, feldspathic, carbonate and REE-enriched sources within the quadrants described previously. Similarly, the chalcophile element association is concentrated along the +PC1 axis for both Fig. 17.3b, c, likely representing weathered and organic-rich material, which adsorb chalcophile elements.

The biplot of Fig. 17.3d (C-horizon) displays a different pattern in comparison with Fig. 17.3a–c. The +PC1 axis shows an association of lithophile elements (Ca-Mg-Na-Sr-P) and chalcophile elements (S-Cd), possibly representing a mix of feldspathic and/or carbonate source material. Along the PC1 axis and on the +PC2 domain, there is a contrast between (Ca-Na-Mg-S-Ba-K) and (Th-Ce-U-La-Nb-Al-Li) that may reflect a feldspathic/carbonate source environment from an environment with relative enrichment in heavy minerals.

Figure 17.4a shows a pattern and association of elements that displays a contrast of the C-horizon data with the surface soil and A-horizon data. Figure 17.4a shows a siderophile and mafic lithophile pattern of Cr-Ni-Cu-V-Co-Fe-Sc along the +PC2 axis. Along the −PC2 axis of Fig. 17.4a there is a lithophile association of Rb-K-Ti\_Ba-Ce-La-Tl. The +PC3 axis in Fig. 17.4a shows a chalcophile/lithophile association of Cd-S-Sb-Ca-P-Se-Hg-As-Mo-Pb-Sr-Zn. This region of the plot is dominated by surface soil and A-horizon data although some C-horizon data are also present. A similar pattern is observed in Figs. 17.4b, c although the groups of the elements are at opposite ends of PC3 (a sign switch). In Fig. 17.4b, c, transitional between the siderophile/lithophile elements (Fe-Sc-Co-Cr-Ni) and the lithophile elements (Rb-Tl-K-Ba) is the grouping of Al-Ga-Nb-Y-Ce-La-Th-U that represents feldspars, clays and heavy minerals. As in Figs. 17.3d and 17.4d, representing the C-horizon data, shows the chalcophile enrichment trend along the +PC3 axis and a siderophile/lithophile trend along the PC2 axis. Transitional between the trend along the PC2 axis is an association of Al-Ga, likely representing feldspars and clays.

### *17.3.4 Mapping the Components*

The first three principal components for the surface soil, the A- and the C-horizons were interpolated using the geostatistical package, gstat (Pebesma 2004). Experimental semi-variograms were generated followed by variogram model fitting with subsequent kriging. The images for the three principal components are shown in Figs. 17.5a–c, 17.6a–c and 17.7a–c.

#### **Principal Component 1**

Geospatially these patterns are observed in Figs. 17.5, 17.6 and 17.7. Figure 17.5a–c show interpolated images based on kriging of the first principal component for the surface soil, A- and C-horizons respectively. The patterns observed in Fig. 17.5a and b are consistent with the patterns observed in Fig. 17.3b and c. The +PC1 axis in Fig. 17.3b and c show relative enrichment of the previously identified chalcophile elements and relative enrichment of the mafic lithophile and siderophile elements along the −PC1 axis. In Fig. 17.5a and b, the positive scores of PC1 appear to correspond with the region in the southeast US and the negative scores of PC1

**Fig. 17.5 a**–**c** Map of kriged principal component 1 for the Surface Soil, A- and C-horizon data. Figures 17.4b–d provide the context for relative element enrichment/depletion associated with each of the layers

**Fig. 17.6 a**–**c** Map of kriged principal component 2 for the Surface Soil, A- and C-horizon data. Figures 17.4b–d provide the context for relative element enrichment/depletion associated with each of the layers

**Fig. 17.7 a**–**c** Map of kriged principal component 3 for the Surface Soil, A- and C-horizon data. Figures 17.4b–d provide the context for relative element enrichment/depletion associated with each of the layers

appear to occur in the northwest US and west of Lake Superior. All three figures show a pattern that coincides with the banks of the Mississippi River. Negative PC1 scores for the surface layer and A-horizon correspond to relative enrichment in Na-Sr-Al-Ca-Mg-K-Ba element associated with feldspars and/or carbonate source material.

The image of PC1 for the C-horizon data (Fig. 17.5c) shows a strong negative region in the southeast US that corresponds to the chalcophile group of elements along the negative portion of PC1 in the biplot of Fig. 17.3d. The positive portion of PC1 in Fig. 17.3d corresponds to the dominantly lithophile and siderophile groups of elements and is displayed as a large region throughout the US, with the exception of the southeast US. The same "corridor" pattern along the Mississippi River is observed in Fig. 17.5c, for the C-horizon results and represent the same relative concentration of lithophile elements observed in the surface layer and A-horizon.

Figure 17.5c shows the kriged image for the first principal component derived from the C-horizon data. In this case, the negative scores are restricted to the eastern US and reflect the chalcophile and rare earth elements indicative of detrital heavy minerals corresponding to the region of quartz enrichment accompanied with weathered and detrital materials within the erosional and weathering domain of the eastern US. Positive PC1 scores reflect a lithophile association of Ca-Na-Sr-Cd-Mg-Ba-K-Mn (Fig. 17.3d) and suggest an environment that is likely dominated by Ca-Na-K-Ba-Sr feldspars and Mg-Ca bearing ferromagnesian minerals.

An important consideration in the interpretation of the biplots is the significance of the associations of the elements. An initial interpretation of the biplots of Fig. 17.3a–d was that the associations of the chalcophile groups indicated relative enrichment of these elements (Hg-Se-As-Sb-Sn-Bi-Pb-S-In) that represent weathered materials along with the accumulation of detrital minerals within the erosional and weathering domain of the southeastern US. In fact, these elements do not reflect relative enrichment but rather relative depletion with respect to the other groups of elements, notably the siderophile and lithophile elements. Geospatially, the chalcophile association of these elements corresponds to the region of a high quartz content in the soil (Smith et al. 2014) and has been termed the "quartz dilution effect". This effect in the soil geochemistry and the subsequent multi-element associations would likely be significantly different had Si been included in the analysis. A test was carried out in which the Si content of the data was simulated as the difference from the potential total (1,000,000 ppm) from the summed content of the compositions. This simulated Si value was then included in the composition and a PCA was carried out. The first component identified the relative Si enrichment as occurring in the southeast US. The simulated value of Si was not included in this study because other elements should also be considered in a total composition, including oxygen and nitrogen.

#### **Principal Component 2**

As shown in Fig. 17.3b, c, the multi-element signature of tpc2 is nearly the same for the surface soil and A-horizon. The patterns in both figures show two trends, one with relative enrichment in Cr-Ni-Co-Cu-V-Fe-Sc (siderophile/lithophile + Cu-Zn) and the other with relative enrichment in Hg-Se-As-Sn-Sb-Pb-Bi-In-S. (chalcophile) These two multi-element associations reflect the chemistry of mafic minerals and elements that are associated with weathering and organic complexing. This is reflected in the maps of Fig. 17.6a, b in which high PC2 values are noted in the eastern and south eastern US and the western US. The negative PC scores for the surface soil and A-horizon show relative enrichment in Rb-K-Tl-Ba-Be-Na-Sr-Al-Ga and, as shown in Fig. 17.6a, b are geospatially concentrated in the central US corresponding to the location of the Sand Hills of Nebraska, (∼105° W/ 42° N), which is comprised of sand-sized particles of quartz and feldspar (Smith et al. 2014). There are also areas of negative PC2 scores, most likely representing feldspars associated with granitoid rocks in southern Nevada, California, Arizona, Texas, New Hampshire and Maine (Smith et al. 2014).

The map of PC2 (Fig. 17.6c) for the C-horizon data shows positive scores associated with the mafic volcanic rocks of the northwest US and corresponds to the relative enrichment of siderophile (Fe-Ni-Co), lithophile (Cr-V-Sc), chalcophile (Cu-Zn) elements as shown in Figs. 17.3d and 17.4d. The negative scores for PC2 show a similar pattern to those of the surface soil and A-horizon; relative enrichment in alkali lithophile elements (Rb-K-Ba-Be-Na-Sr) with Al-Ga representing feldspars and REE lithophile elements (U-Th-La-Ce-Ng-Tl) that represents heavy minerals and quartz (as explained previously). The geochemical expression of these minerals in PC2, which are resistant to weathering, are reflected in both horizons and the surface soil.

#### **Principal Component 3**

The positive scores for the PC3 show relative enrichment of siderophile, mafic lithophile, and light REE elements for both the surface soil and A-horizon; whereas this pattern is represented by negative scores for the C-horizon. As shown in Fig. 17.4b–d, for all three layers, there is a continual transition from relative enrichment in alkali lithophile and REE elements, including Al and Ga, representing feldspars and minerals associated with felsic domains to relative enrichment in Cr-Ni-V-Cu-C-Fe-Sc-Ti-In-Zn that represents minerals associated with mafic domains. Figures 17.7a–c show the kriged images for the third principal component. The negative scores show relative enrichment of Cd-S-Ca-Sr-Sb-P-As, which may reflect the processes of organic complexing and sulphates. Negative scores noted in Utah, Nevada, west Texas, the Mississippi delta and south Florida may have a greater component of S. Negative scores that occur in Minnesota, Michigan, Indiana and the coast of New England may reflect the presence of shales, clays and organic accumulations. The negative PC3 scores of Fig. 17.4b exhibit a bimodal pattern of relative enrichment of Fe-Sc-In-Ti and Ga-Al-Y-Nb-Ce-La. The Fe-rich pattern is associated with the mafic volcanic rocks in the northwest and southwest US and the Ga-rich pattern occurs in the eastern US and reflects the presence of feldspars in the weathering of granitoid rocks in the southern Appalachians.

As seen in Fig. 17.4c, and nearly identical to that the of surface soil, the positive scores of PC3 exhibit a bimodal pattern for the A-horizon and indicate relative enrichment of Ti-Sc-Fe-In-V and Ga-Al-Th-La-Nb-Ce. These two groups reflect both a mafic and feldspathic/heavy mineral rich environment. Figure 17.7b shows the mafic association (Ti-Sc-Fe-In-V) in the northwest US. The positive scores in the eastern, southern, and in particular, the southeast US reflect elements associated with feldspars and heavy minerals, which reflects the concentration of minerals through the weathering process, which may be due more to gravitational effects than chemical breakdown. As in Fig. 17.7a, the negative scores of PC3 in the A-horizon demonstrate the same patterns and processes.

The C-horizon map shows two distinct geospatial patterns. The positive scores of Fig. 17.4d show relative enrichment in the chalcophile group, Sb-As-S-Mo-Se-B-Cd-Hg-U-Li-W and occur primarily in the southeast US. This pattern likely reflects both the quartz dilution effect and the presence of chalcophile elements relative to other areas throughout the US. The negative scores, which show relative enrichment of the lithophile elements Al-Ga-Na-Y-K-Be-Ba-Mn-Ti-Fe-Sc-Co, reflect a combination of mafic minerals and feldspars. These patterns are observed in the western US, Minnesota-Wisconsin, central Appalachia and the northeast US. Patterns associated with the elements that reflect mafic domains are the northwest US and Wisconsin-Minnesota. Patterns that reflect the feldspathic domains are Nebraska-Colorado, central Appalachia and the northeast US.

Evaluation of the soil geochemistry for the surface soil, the soil A horizon and the soil C horizon using a principal component approach reveals that there are continental-scale geochemical patterns that appear to be associated with the composition of the underlying soil parent material, climate, and weathering. At the scale of evaluation, details on specific lithologies are difficult to resolve, but the patterns are consistent with those mineralogical patterns delineated by Smith et al. (2014).

#### *Process Validation Predictive Mapping of Surface Lithologies*

The lithology of surficial materials by Sayre et al. (2009) is represented by 18 classes plus unknowns and listed in Table 17.2. A total of 17 classes were selected for further study. The classes "unknown" and "water" were not used as they were not considered suitable for classification.

Figure 17.8 shows a map of the sampling sites with the surface materials lithology from Sayre et al. (2009). The patterns of surface materials on the map show some similarities with the patterns observed from the first three principal components for the surface soil, A- and C-horizons. Figure 17.9 shows a biplot of the first two principal components that are coded according to the surface lithologies. The pattern of the mafic lithophile elements (Cr-Ni-Cu-V-Co-Fe-Sc) in Fig. 17.9a, b are dominated by silica-rich residual soils (SilRes), whereas the chalcophile enrichment pattern (Hg-Se-Mo-Sn-Bi-Pb-Sb-As-Ti-S-In) appears to be associated mostly with alluvium (Alluv) and coastal zone sediments (CZS). The lithophile element grouping in the negative portion of the PC2 shows a mix of several lithologies. The results of the PCA suggest that the linear combinations of elements from the PCA are related to the patterns observed in Surface Materials Lithologies of Fig. 17.8.


**Table 17.2** List of surface lithologies across the conterminous United States

a Not Used

From the application of the random forest classification, the Gini Index (significance of the variables) for the surface soil, A- and C-horizons are listed in Table 17.3 and shown graphically in Fig. 17.10. The significance uses the Gini Index, which is a measure of purity based on the success of a variable in distinguishing between classes. Table 17.3 shows that generally, PC's 4, 5, 1, 2, 3 and 6 are the best variables for classification of the surface lithologies for the surface soil, A- and C-horizons. Maps of the normalized votes in point form and interpolated (kriged) maps of the raw votes are shown in the Supplementary Annex (Supplementary Figs. 1–15).

**Fig. 17.8** Map of soil sample sites coded by the Surface Lithology classification. This map represents the actual classification based on the maps of Sayre et al. (2009). Colours used in this figure are the same colours used in Sayre's maps. See text for details on how the sites were selected

**Fig. 17.9 a**–**c** Principal component biplot of the surface layer (**a**), A-horizon (**b**) and C-horizon (**c**) scores that are coded and coloured according to the surface lithologies

Table 17.4 shows the accuracy of prediction for each of the surface lithologies based on the Random Forest out-of-bag classification methodology for each of the surface soil, A- and C-Horizons. The table has been ordered from the highest to the lowest prediction accuracies based on the surface soil. It is worth noting that the depth of soil has only a minor influence in the prediction accuracies, suggesting that the geochemical signature of the underlying material persists throughout the soil column. Non-carbonate residual soils (NCaRes) (∼74%), loam associated with glacial till (GTLoam) (66–72%), siliceous residual soils (SilRes) (48–56%), alluvium (Alluv) (∼50%) and coastal zone sediments (CZS) (45–48%) have the highest prediction accuracies, whereas the lowest accuracies are associated with hydric peat and muck (HyPM) (0%), alkalic intrusions (AlkInt) (0%), glacial lake sediments (GlLs)


**Table 17.3** List of variable importance for the surface layer, A- and C-horizons as determined from Random Forest classification of the principal component results applied to the clr-transformed data. Colours reflect the most significant PCs (red) to least significant PCs (blue)

**Fig. 17.10** Plot of the significance of the principal components used in the random forest classification based on the Gini Index for the Surface Layer, A- and C-horizons. See the text for a detailed explanation

**Table 17.4** Measures of ordered predictive accuracy for the surface lithologies for the surface layer, the A- and C-horizons based on a Random Forest classification of the principal component results applied to the clr-transformed data


(0–1%) and extrusive volcanics (ExtVR) (0–6%). The prediction accuracy is sensitive to the initial representation of each class in the dataset. This sensitivity is partly due to the masking and swamping effect that a large population of sites for one type of surface lithology over another (i.e. Alluvium vs. Hydric Peat and Muck).

Supplementary Tables 2, 3 and 4 provide a complete summary of the prediction accuracies for the surface soil, A- and C-horizons, respectively. The diagonal of each upper table (Tables 2a, 3a, 4a) indicates how many sample sites were classified correctly. Each row of the off-diagonal elements indicates the misclassification of the sites for each of the classes. The lower tables in Tables 2b, 3b, 4b show the classification accuracies as expressed in percentages. The overall classification accuracy is shown at the bottom of each table. Scanning the columns of Tables 2a, 3a, and 4a reveals that many classes are confused with alluvium (Alluv), siliceous residual material (SilRes), loam derived from glacial till (GTLoam) and non-carbonate residual material (NCaRes). Alluvium and non-carbonate residual material appear to overlap with almost all of the classes. The overall prediction accuracies for the surface soil, A- and C-Horizons are 50%, 49% and 49%, respectively.

The R package "**randomForests**" produces raw and normalized votes for each of the classes. Votes are a record of the number of times a site is correctly classified. As described above, normalized votes are the equivalent of a posterior probability and are therefore compositions. Classes such as AlkInt, HyPM and other classes that have low abundance in the data create problems in the creation of co-regionalization that is required for co-kriging. Examples of the spatial distribution of the normalized and raw votes are shown below. The Supplementary Annex provides predictive maps for all of the surface lithologies, based on the normalized votes, for the surface soil, A- and C-horizons. Predictive maps for AlkInt and HyPM are not shown because the normalized votes for these two surface lithologies were very low and do not show any geospatial patterns. The prediction accuracies for the three media from Table 17.4 are: 49.9%, 49.4% and 48.6% respectively. Supplementary Tables 2, 3 and 4 provide details on the overlap of predictions for each surface lithology. In most cases, overlap is associated with non-carbonate residual soils, glacial till derived loam and alluvium. These three classes have the broadest range of compositional variation and occupy a significant amount of area across the conterminous US.

Figure 17.11 shows a map of normalized votes of Non-carbonate residual soils (NCaRes) derived from the random forest classification. Normalized votes >0.3 occur throughout the Midwest states from the Canadian border in the north to the Gulf of Mexico in the south. From Table 17.4, the overall classification accuracy is approximately 75% for the surface soil and the two soil horizons. Supplementary Tables 2, 3 and 4 show that compositional overlap occurs primarily with alluvium, which is also shown in the maps of Fig. 17.11 where a large number of sample sites show low normalized votes (∼0.2–0.3). Supplementary Fig. 13a, b show the normalized and raw vote maps of the NCaRes prediction.

Figure 17.12 shows a map of normalized votes for loam derived from Glacial Till (GTLoam). The overall classification accuracy ranges from 65.7 to 71.6% over the three soil layers. Supplementary Tables 2, 3 and 4 show the overlap of the GTLoam composition is associated with non-carbonate residual material (NCaRes) and alluvium (Alluv) for the surface soil, A- and C-horizons (Supplementary Tables 2, 3, 4). The pattern of elevated normalized votes coincides with the region described by Sayre et al. (2009) that is located in the north central US and south of the Great Lakes. The pattern of elevated GTLoam follows the course of the Mississippi River, which highlights the erosional path of this material. Supplementary Figs. 12a, b show the normalized and raw vote maps of the GTLoam prediction.

Normalized votes for the prediction of alluvium (Alluv) are shown in Fig. 17.13 (Supplementary Fig. 1). The overall prediction accuracy is ∼50% (Table 17.4) and compositional overlap is observed with the surface lithology non-carbonate residual soil (NCaRes) (Supplementary Tables 2, 3, 4). High predictions of alluvium are located in Nevada, western Texas and the southeast US states. The dispersed prediction of 0.2–0.3 represents the regions of compositional overlap with NCaRes, which can be seen on the map of Fig. 8. Supplementary Figs. 1a, b show the normalized and raw vote maps of the Alluv prediction and supplementary Figs. 13a, b show the normalized and raw votes of the NCaRes prediction.

Figure 17.14 shows prediction based on the normalized votes for the Eolian Dunes (EolDune) of Nebraska, southward into Texas. The patterns are the same for the surface soil, A- and C-horizon maps. The highest values of normalized votes

**Fig. 17.11** Map of normalized votes for the surface lithology class, non-calcium residual soil (NCaRes). Sites with a normalized vote of less than 0.2 are omitted

**Fig. 17.12** Map of normalized votes for the surface lithology class, loam derived from glacial till (GTLoam). Sites with a normalized vote of less than 0.2 are omitted

**Fig. 17.13** Map of normalized votes for the surface lithology class, alluvium (Alluv). Sites with a normalized vote of less than 0.2 are omitted

**Fig. 17.14** Map of normalized votes for the surface lithology class, eolian dunes (EolDune). Sites with a normalized vote of less than 0.2 are omitted

occur in Nebraska and west-central Texas. The map of Sayre et al. (2009) shows EolDune in northern Texas and the Oklahoma Panhandle, although these two regions are not predicted in the surface soil, A- or C-Horizon results. Table 17.4 shows predictive accuracies of 22.3, 22.4 and 16.5% for the surface soil, A- and C-horizons, respectively. Supplementary Tables 2, 3 and 4 show that compositional overlap occurs with alluvium (Alluv) and non-carbonate residual soil (NCaRes). Supplementary Figs. 5a, b show the normalized and raw vote maps of the EolDune prediction.

The effects of erosion and subsequent re-deposition along the banks of the Mississippi River is observed for several of the surficial lithologies. NCaRes, CaRes and Colluv exhibit an erosional pattern along the Mississippi River, while EolLoess, GlLS, GlOut and GTLoam exhibit depositional patterns. This suggests that the recent deposition of the sediments along the banks of the Mississippi River has modified the composition of the upper layers of the soil. These classes (Eol-Loess, GlLS, GlOut, GTLoam—Supplementary Figs. 6a, b, 8a, b, 9a, b, 12a, b) show a distinct compositional presence down the length of Mississippi River starting from the northern Midwest states and reflecting continued transport of these materials at a continental scale.

A brief description of the maps for the surface soil, A and C-horizon data that are displayed in the Supplementary Annex are discussed in the section, Supplementary Material.

### **17.4 Discussion**

Examination of the principal component biplots (Figs. 17.3 and 17.4) show that the multi-element patterns are very similar for the surface soil and A-horizon data. The C-horizon biplots show similar multi-element groupings, but the shape of the point patterns (Figs. 17.3d and 17.4d) are different from those of the surface soil and A-horizon (Figs. 17.3b, c and 17.4 b, c). As described previously, the element groupings for the three sampling layers are:


These associations are slight variants on Goldschmidt's classification of elements; lithophile (Group 1), siderophile (Group 2) and chalcophile (Group 3).

The principal component biplots, along with the maps of the dominant principal components (Figs. 17.5, 17.6 and 17.7), indicate that there is strong stoichiometric and geospatial control on the patterns that are observed. These patterns, both in the biplots and the kriged map images, provide the justification to use the soil geochemical data to predictively map (validate) the surface lithology classification of Sayre et al. (2009). It should be noted that Sayre's map of surface lithologies does not distinguish lithologies with different mineralogies, and, hence there is considerable overlap between some of the classes defined by Sayre.

The results of the random forest classification show that for most of the surface lithology classes, the accuracy of prediction and spatial coherence of the predicted sites is variable, as shown in Table 17.4 and Figs. 17.11, 17.12, 17.13 and 17.14 and the Supplementary Tables and Figures. The surface lithologies with the lowest predictions are: Hydric Peat and Muck (HyPM), Alkalic Intrusives (AlkInt), Glacial Lake Sediments (GlLS), Extrusive Volcanic Rocks (ExtVR) and Saline Lake Sediments (SalLS). Two factors influence the classification accuracy. The first is the areal extent that a given class occupies. The compositional range of a class of small spatial extent may be swamped or masked by the compositional range of a class that is geographically adjacent to it and has a much larger areal extent. Surface lithologies such as AlkInt, HyPM ExtVr, SalLS and GlLS have limited geospatial extent and the compositions of these lithologies are similar to several other lithologies, including Alluv GTLoam and NCaRes. The second factor that influences the prediction accuracy is the common compositions of several of the surface lithology classes namely, alluvium (Alluv), non-carbonate residual soil (NCaRes), and silica-rich residual soil (SilRes). These surface lithologies are comprised of similar mineralogies and are, therefore, compositionally similar and result in compositional overlap in the statistically based prediction process.

Silicate mineralogy, including quartz, is under-represented in the data used for this study. As discussed previously, the quartz dilution effect has an influence on how the various relationships of the elements are observed, particularly in the methods that are part of the "Process Discovery" component of this study. The absence of silicon in the geochemical analysis in terms of the classifications may have some effect on the ability to distinguish between the different surface lithologies, but the exact effect is unknown at this time and further studies where Si is included and subsequently excluded in process discovery studies are warranted.

The validation of surface lithologies using soil geochemistry highlights some of the limitations on predicting distinct surface lithologies that have similar geochemical compositions but represent different processes. Despite this confusion of compositions between surface lithology classes, the predictive maps render a close representation of the maps of Sayre et al. (2009).

### **17.5 Concluding Remarks**

The multi-element soil geochemistry over the conterminous United States contains a rich set of information that reflects the original source material and subsequent modification through weathering, mass transport, climate and biological activities. As a result, continental-scale geochemistry may represent many processes. In this study, we have focused on the evaluation and interpretation of the multi-element soil geochemistry from the surface soil, A- and C-horizons in the context of predicting the surface lithologies.

Process discovery makes use of multivariate methods such as principal component analysis, which creates orthogonal linear combinations of the elements that often reflect processes controlled by mineral stoichiometry that comprise the parent material. This parent material may be bedrock (igneous, metamorphic, sedimentary), glacial deposits, loess or fluvial deposits. Ideally, soil geochemistry can be used to predict the composition of the underlying soil parent material. As demonstrated in this study, multivariate methods such as principal component analysis cannot decouple all of these processes. Processes such as igneous and metamorphic mineral reactions share similar mineral stoichiometry, making them indistinguishable from a geochemical perspective. Many distinct sedimentary assemblages are comprised of similar lithologies with similar mineralogy, and are thus difficult to distinguish solely on a geochemical basis.

With the exception of the surface lithology map of Sayre et al. (2009), a continental-scale map of lithology does not exist, which creates difficulty in an attempt to predictively map at large scales. However, the availability of the maps by Sayre et al. (2009) that include terrestrial ecosystems, thermoclimate, soil moisture and surface lithologies provides an opportunity to test the capacity of soil geochemistry to uniquely define these features. Although not presented here, the soil geochemistry has the ability to uniquely define terrestrial ecosystems and regional climate indicators. We intend to publish the results of using soil geochemistry to uniquely identify the terrestrial ecosystems, thermoclimatic zones and soil moisture (ombrotype) as defined by Sayre et al. (2009).

With few exceptions, there are only minor differences between the geochemical compositions of the surface soil and the A-horizon. The geochemistry of the C-horizon displays a distinct geochemical difference between the surface soil and A-horizon as it has not undergone the degree of weathering as the near-surface soils and contains less organic material.

The overall predictive accuracies for the predicting the surface lithologies for the surface soil, A- and C-horizons are 49.9%, 49.4% and 48.6%, respectively. As described above, the reasons for these low accuracies are due to the overlap of many of the lithologies with Alluvium, Non-carbonate residual soils, Siliceous soils, Eolian Dunes, Eolian Loess and materials deposited from glaciation. However, the spatial continuity of the posterior probabilities confirm the distinctiveness of these lithologies and demonstrate the effectiveness of soil geochemistry in recognizing the differences between the classes.

The geochemistry of soils represents modification of the initial parent material through weathering in response to varying precipitation and temperature, groundwater effects, meteoric water effects, biologic activity and geologic complexity. Thus, geochemistry is a rich source of information that can be used in many ways to describe, monitor and predict processes derived from natural and anthropogenic events (Grunsky et al. 2013).

The results from the statistical evaluation of the geochemical data in the context of predicting surface lithologies across the conterminous US indicates that soil geochemistry reflects a number of physical processes. Further studies of the soil geochemistry across the US will evaluate the ability to predict terrestrial ecosystems and indicators of climate.

**Acknowledgements** The authors thank Karl Ellefson of the United States Geological Survey for his thoughtful and helpful review of the manuscript.

### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

## **Part III Exploration and Resource Estimation**

### **Chapter 18 Quantifying the Impacts of Uncertainty**

**Peter Dowd**

**Abstract** This chapter reviews the general concepts of uncertainty and probabilistic risk analysis with a focus on the sources of epistemic and aleatory uncertainty in natural resource and environmental applications together with examples of quantifying both types of uncertainty. The initial uncertainty in these applications arises from the in-situ spatial variability of variables and the relatively sparse data available to model this variability. Subsequent uncertainty arises from processes applied either to extract the in-situ variables or to subject them to some form of flow and/or transport. Various approaches to quantifying the impacts of these uncertainties are reviewed and several practical mining and environmental examples are given.

### **18.1 Introduction**

This chapter provides an overview of the quantification of uncertainty with a focus on mineral and energy resources and environmental applications drawing on the work of the author and his co-authors over the past 30 years. Rarely in mining applications do initial estimates reconcile with production—there is almost always some reverse calibration or model revision to achieve an operationally acceptable agreement. This feedback approach can be a useful means of model calibration but the production 'reality' is an outcome conditional on the model and data used to make the production decision and may be biased. The resort to post hoc empirical calibration is due partly to insufficient data and partly to inadequate accounting for all sources of uncertainty. This situation will worsen as, increasingly, mineral resources will be extracted from deeper and/or lower grade deposits, which will require new technologies and new types of indirect sampling. In applications such as hydrocarbon extraction, the feedback reconciliation approach is essential because

P. Dowd FREng, FTSE (✉)

The University of Adelaide, Adelaide, Australia e-mail: peter.dowd@adelaide.edu.au

<sup>©</sup> The Author(s) 2018 B. S. Daya Sagar et al. (eds.), *Handbook of Mathematical Geosciences*, https://doi.org/10.1007/978-3-319-78999-6\_18

the in-situ variables can never be directly observed; Caers (2011) gives a comprehensive account of uncertainty quantification for these types of application.

The focus here is on geological applications in which the purpose is to extract material, store material or monitor the flow of fluids or contaminants. In these applications, uncertainty arises from two sources of variability: the in-situ variability of the geology and associated quantitative variables and the variability that is generated by applying processes to the in-situ resource. The basic approach is to combine data with a model to make predictions. Such predictions are meaningless unless accompanied by quantitative measures of the uncertainty of the prediction.

The general focus, particularly in mining applications, has been on the uncertainty arising from sparse data and not on uncertainty arising from the model, even though the model is inferred, and its parameters are estimated, from the sparse data. Variability arising from processes applied to the in-situ resource is either quantified in an overly simplistic manner or is ignored. The additional aspect in these and most spatial applications is that variability (and, therefore, uncertainty) is scale-dependent and may be relevant on multiple scales depending on the application.

### **18.2 Sources of In-Situ Uncertainty**

In the field of uncertainty and probabilistic risk analysis two types of uncertainty are identified: aleatory and epistemic uncertainty (or irreducible and reducible uncertainty). In the generally accepted definitions (e.g., Bedford and Cooke 2001), aleatory uncertainty arises from the inherent variability of a phenomenon and cannot be reduced; epistemic uncertainty arises from incomplete knowledge of the phenomenon and can be reduced by more data, analysis or research. As both types of uncertainty are expressed in terms of probabilities, some authors question the necessity to distinguish between them. Others (e.g. Hora 1996; Winkler 1996) prefer sources of uncertainty rather than types, "the distinction between uncertainties is a matter of choice of scale and is, therefore, mutable." In the geostatistics context, Matheron (1975, 1976, 1978), notes that the empirical basis of uncertainty is the same in both cases and there is no objective criterion to distinguish them. Journel (1994) gives guidelines for modelling uncertainty on which Srivastava (1994) provides critical comment. However, as Winkler (1996) noted "uncertainty is uncertainty but the distinctions are related to very important practical aspects of modelling and obtaining information". This is especially so in the applications given here.

A fundamental difference between geological applications and many others is that each occurrence (orebody, karst system) is unique and, apart from measurement error, once a physical sample is taken at a location and the required variable is measured directly from the sample, there is no longer any uncertainty about the value of the variable at that location. The general geostatistical model includes stationarity, which allows for repeated sampling of the same random variable at different locations. In principle (but not in practice), all locations in an orebody could be sampled and aleatory uncertainty would be eliminated. Thus, in these applications aleatory uncertainty is entirely a function of the amount and quality of data. Epistemic uncertainty arises from the assumed or inferred geological model (e.g., type, or style, of mineralisation). In mining applications, at least in terms of a general model, there may be significant epistemic uncertainty during early stages of proving a deposit when geological models are inferred from sparse data. Model uncertainty may persist in later stages in terms of the specific characteristics or parameters of the model.

In some natural resource applications, the variables that define the resource can never be directly observed. For example, in hot dry rock (HDR) enhanced geothermal systems, the variable of interest is the combination of natural and stimulated fractures that form connected networks to extract heat. These fractures, at depths of up to 4.5 km, can never be directly observed or measured; their locations, extents and characteristics can only be inferred from micro-seismic events generated by fracture movement, stimulation and propagation (e.g., Xu and Dowd 2014). In these applications, the detailed model can never be known irrespective of the amount of data available. As mineral resources are extracted from increasingly deeper deposits there will be a move from physical samples, from which variables are directly measured, to sensed proxy variables and a move from traditional mining methods to in-situ recovery. For indirectly sensed variables, the aleatory uncertainty of the required variable (e.g., porosity) is largely due to the quality of the relationship with the directly sensed proxy variable (e.g., acoustic impedance), which could be classified as measurement, or interpretation, error.

Thus, although both sources of in-situ uncertainty in these applications are functions of the amount of data, it is useful to distinguish between them in quantifying uncertainty. Hereafter, epistemic uncertainty is used to mean conceptual or descriptive geological models as well as quantitative parametric models that describe spatial variability and in which parameter values are calculated or inferred from data.

Although epistemic uncertainty is recognised, it is largely ignored in practice. Once a model is assumed or inferred and/or its parameters are inferred or estimated from the available data, all measures of uncertainty are based on the data; in most applications, the model of spatial variability is implicitly assumed to be known with certainty. In other fields, there has been a longstanding recognition of the importance of identifying and quantifying both sources of uncertainty and of propagating them into a complete systems model (e.g., Bedford and Cooke 2001; Helton et al. 2004; Oberkampf et al. 2002, 2004). In natural resource applications, particularly mining, the emphasis has largely been on aleatory uncertainty with implicit acceptance that epistemic uncertainty is negligible. Geostatistical simulation is widely used to quantify the effects of limited data on resource modelling and estimation (aleatory uncertainty) but the model (e.g., variogram, spatial pattern) is generally assumed to be perfectly known (no, or negligible, epistemic uncertainty).

### **18.3 Transfer Uncertainty**

A further complication in mineral and energy resources is that there are additional significant sources of uncertainty in extraction and processing to produce a final product. To borrow a petroleum industry term these might be called transfer, or process, functions and the associated uncertainties, transfer or process uncertainty. A general approach to integrating this source of uncertainty is to quantify all sources of in-situ uncertainties and propagate them into simulated transfer processes (e.g., blasting, selective loading, transport, mineral processing).

In resource extraction applications, it is useful to distinguish two broad types of process (or transfer) uncertainty:


### **18.4 Consequences of In-Situ Uncertainty**

There are broadly two aspects of a geological model used in mineral resource applications: the generic type (e.g., stratiform silver/lead/zinc orebody) and the unique aspects that distinguish a specific orebody within the type (e.g., faulting, folding, degree of spatial continuity and of regularity of orebody boundaries). In general, for mineral deposits the first of these is known with near certainty at a relatively early stage but the distinguishing aspects and the relevant scales on which these aspects occur may not be known until much later. In these applications, the two types of in-situ uncertainty are not independent. The sampling scale (e.g., drilling grid) is determined, or at least significantly informed by, the geological model; the sampling scale determines the data, the spatial variability of which is the aleatory uncertainty; the parameters of the model are estimated by the data.

The Stekenjokk mine in Sweden provides a striking example of the consequences of epistemic uncertainty. Boliden Mineral AB mined this massive copper-zinc-silver orebody from 1976 to 1988 and processed a total of 8 M tonnes of ore. Prior to mine development the drilling grid was 20 m × 20 m and, in places, 20 m × 10 m. Figure 18.1 is an idealised, but typical, vertical cross-section through the orebody showing the drill-hole intersections with the ore. Drilling data were combined with the assumed geological model to generate the estimated orebody boundaries. Figure 18.1 shows the complex, multi-directional folding of ore zones encountered in mining. The practical consequences of these predictions were significant (Hoppe 1978):


In principle, the problem could have been resolved by more appropriate sampling but the "appropriateness" of sampling was determined by the assumed geological model. In addition, sampling is constrained by cost (relative to the value of the mined product) and the cost of a drilling grid capable of capturing the folding may well have been prohibitive.

Geological models are only as good as the quality and interpretation of the data and the appropriateness of the scale on which the data are collected. Stekenjokk is an extreme (but not unique) example of epistemic uncertainty that could only be

**Fig. 18.1** Interpolation of ore continuity from surface drilling data prior to mine development; adapted from Hoppe (1978)

reduced to an acceptable level by more data. However, this observation is somewhat circular: the geological model depends on the amount of data/information available but the data type and collection are informed by the assumed model.

### *18.4.1 Scale and Variability Example: Hilton Orebodies Australia*

This example is from a study of a complex group of three silver/lead/zinc orebodies at what, at the time, was known as the Hilton mine in north-western Queensland, Australia. The full study is given in Dowd and Scott (1984) with a later study in Dowd et al. (1989).

The Hilton orebodies are 22 km north of Mt Isa, one of the world's largest stratiform base metal deposits. The Hilton orebodies have a similar diagenesis to the Mt Isa orebodies with mineralisation occurring in the same dolomitic shale. The study was undertaken at the pre-feasibility stage and all original drilling, sampling and interpretation were influenced by 50 year's mining experience at Mt Isa. Although the Mt Isa and Hilton styles of mineralisation are similar, the Hilton orebodies are structurally more complex and less continuous.

Two test areas were extensively drilled to provide detailed information for a geostatistical study to determine optimal drilling densities for mine planning purposes. The holes were drilled from access drives as fans on cross-sections spaced 10 and 20 m apart. One such cross-section is shown in Fig. 18.2 in which the holes intersect the main 2 orebody footwall lens (2 O/B FW) at approximately 5 m centres. The dark blue outlines in Fig. 18.2 are the orebody boundaries estimated from the drill-hole data on the cross-section and on the cross-sections on either side. In the feasibility stage cost would prohibit such a drilling density over the entire orebody. Given the density of the drilling these estimated boundaries could be regarded as reality on all practical scales.

The effects of other drilling densities were assessed by removing drill data to create new datasets; e.g., removing every second drill-hole on a cross-section yields a 10 m spacing. Datasets for 5, 10, 20 and 40 m drill spacing were used in the study. Orebody boundaries were estimated for each drilling density and the results were given to mining engineers to design stopes. As an example, the estimated orebody boundaries for 20 m drill spacing is shown in Fig. 18.3. As expected, these boundaries are much smoother (less variable, more continuous) than the "reality" represented by the boundaries estimated from the 5 m spacing dataset. The variability of the boundaries is critical in the choice of mining method: the variability of boundaries and their exact delineation are less critical if a bulk mining method is adopted than if more selective methods are used. The original mining method was cut and fill followed later by sub-level open stoping and bench mining.

Figure 18.4 shows the 5 m interpolation overlaid on the 20 m interpolation. Taking the 5 m interpolated boundaries as reality, all visible light blue areas

**Fig. 18.2** Cross-sectional interpretation based on 5 m drill spacing

**Fig. 18.3** Cross-sectional interpretation based on 20 m drill spacing

represent ore dilution arising from planning and extraction based on the 20 m interpolated boundaries.

Figure 18.5 shows the 20 m interpolation overlaid on the 5 m interpolation. Again, taking the 5 m boundaries as reality, all visible dark blue areas represent the ore loss arising from planning and extraction based on the 20 m interpolated

**Fig. 18.4** Overlay of 5 m interpolation on 20 m interpolation

**Fig. 18.5** Overlay of 20 m interpolation on 5 m interpolation. Based on 20 m model, all visible dark blue areas represent ore loss

boundaries. Of course, the perfect selection and the adherence to estimated boundaries during production implied by this exercise are not entirely realistic. However, the impact on the choice of mining method, on the predicted grades and tonnages, and on economic outcomes is real.

The outputs from the stope design exercise are summarised in Fig. 18.6 for 5, 10 and 20 m drill spacing. Orebodies 1 and 2 H/W (hanging wall) are mined in a single stope and orebodies 2F/W and 3 are mined in separate stopes. Grades were estimated by kriging and are in metal equivalents of lead (weighted sum of lead, zinc and silver grades); intervals are ±2*σ<sup>K</sup>* where *σ<sup>K</sup>* is the square root of the kriging variance and is used as an index of uncertainty rather than a confidence interval. Taking the 5 m designs as actual boundaries, the stope designs based on 10 and 20 m drilling show the effects of decreasing amounts of data on planned tonnage and average grade.

20m drill spacing

**Fig. 18.6** Stope designs with contained tonnages and grades for 5, 10 and 20 m drill spacing for orebodies 1 and 2 HW (left); 2 FW (centre) and 3 (right)


**Table 18.1** Differences in tonnes and grades of stopes compared with 5 m designs

The stope designs are based on the data and interpretations from the respective drilling densities but the grades and tonnages are estimated using all data (5 m drill spacing). Assuming the data from the 5 m drill spacing gives the closest possible quantification of reality on all practical scales then the grade and tonnage of the 10 and 20 m stope designs estimated from all data can be regarded as sufficiently close to the real tonnage and grade that could be recovered from the designs.

The effects of data density on grades and tonnages are summarised in Table 18.1. As an example, using the 20 m drill spacing data to design stope 2 (the high-grade orebody 2 footwall) would increase tonnage by 21.4% and reduce grade by 9.6%. There would an increase in metal tonnage of 9.7% but this would at the cost of mining, hauling and processing the additional ore tonnage.

Whilst the effects of data on a specific type of mining are of interest, the more important issue is the effect of the assumed geological model on the choice of mining method. The initial geological model was influenced by the knowledge accumulated over a long period of mining in the neighbouring Mt Isa orebodies. The detailed analysis described here enabled the effects of the greater complexity and less continuity of the Hilton orebodies to be systematically quantified, thereby significantly reducing the impact of epistemic uncertainty and contributing to the selection of the most appropriate mining method and mine design.

### **18.5 Quantifying Epistemic Uncertainty**

In the Hilton example, geological model uncertainty was addressed at the significant cost of more samples—effectively eliminating the epistemic uncertainty on the operational scale through more data and analysis. With the hindsight of the additional data and analysis, and on the assumption that the test volume is sufficiently representative of the remainder of the orebodies, the epistemic uncertainty associated with various drilling grids could be quantified. This would allow assessment of the value of additional information against the cost of collecting it and/or the operational cost of not collecting it. Stekenjokk is an example of the practical consequences of proceeding with an unacceptable level of epistemic uncertainty.

There is an extensive literature on using Bayesian probability to quantify epistemic uncertainty particularly to combine sources of uncertainty (e.g., Winkler 1981; Sankararaman and Mahadevan 2011) and to incorporate expert knowledge and informed guesses in the form of subjective probabilities. It can be argued that subjective probabilities are used implicitly throughout geostatistical analysis, modelling, estimation and simulation irrespective of the amount of data. Expert knowledge/judgment guides variogram calculation and interpretation, choice of training images, domaining, sample differentiation, choice of estimation or simulation method and validity of outputs. There is, however, a distinction between the explicit subjective probability of informed guesses and possible geological models and the implicit subjectivity in inferring model parameters from quantitative data.

In the remainder of this chapter, a distinction is made between model uncertainty and uncertainty of the parameters of a specific model. Many authors do this although in some cases the former may be a case of the latter e.g., it might be argued (with some difficulty) that Stekenjokk was a matter of incorrect structural parameters (degree of folding). A more convincing argument could be made for the Hilton case—the initial assumed model was a Mt Isa type stratiform orebody and the final agreed version was a more complex and less continuous version of the latter.

In addition to Bayesian approaches, others include evidence theory: Shafer (1976) and Dempster (1968); fuzzy sets: (Zadeh 1965); and possibility theory: Zadeh (1978) and Dubois and Prade (2001). These and other approaches are extensively used to quantify uncertainty in risk analysis and a good coverage of probabilistic risk analysis is given in Bedford and Cooke (2001).

Over the past 30 years, all these approaches have been used to incorporate model uncertainty in geostatistical estimation and simulation and the following list is intended as representative rather than exhaustive. Omre (1987) used Bayesian kriging to include qualified guesses when few data are available; the weight assigned to the guess increases as the amount of data decreases.

Fuzzy kriging has been proposed as a means of including aleatory uncertainty (in the sense of inaccurate or imprecise measurements) and epistemic uncertainty (imprecise variogram parameters) in estimation. Uncertain data will, of course, lead to an uncertain variogram but certain (accurate, error-free) data will not necessarily lead to a certain variogram. Diamond (1989) proposed fuzzy kriging to deal with uncertain or imprecise data. Bardossy et al. (1988, 1990a, b) proposed fuzzy kriging for dealing with both sources of uncertainty but the computational cost hindered its use. More recently, Loquin and Dubois (2010a, b) have developed these approaches in computationally feasible forms. Bandemar and Gebhardt (2000) combine fuzzy kriging with Bayesian incorporation of prior knowledge. Bardossy and Fodor (2004) provide a comprehensive coverage of the use fuzzy set theory to quantify geological uncertainty and consequent risk.

Srivastava (2005) used probabilistic modelling of ore lenses to account for uncertainty in the boundaries of geological domains that constrain grade occurrence. Dowd (1986, 1994) and Dowd et al. (1989) used deterministic and probabilistic methods for the same purpose in estimating and simulating grades.

Verly et al. (2008) quantified geological model uncertainty in a porphyry copper deposit by simulating the four principal characteristics of porphyry models: faults defining fault blocks; faulted rock types within fault blocks; un-faulted intrusive and breccia bodies and alteration and copper grade shells.

Maximum likelihood estimation of spatial model parameters has been widely reported in geostatistical applications: Mardia and Marshall (1984), Kitanidis and Lane (1985), Zimmerman (1989), Dietrich and Osborne (1991) among others. Pardo-Igúzquiza and Dowd (1997a, b, c, 2003, 2013), Dowd and Pardo-Igúzquiza (2002) and Pardo-Igúzquiza et al. (2013) used maximum likelihood estimates of variogram parameters and associated uncertainties to incorporate the effects of model uncertainty in simulation and estimation.

For categorical variables, such as geological shapes and surfaces, multiple point statistics simulation provides a means of specifying possible geological scenarios in the form of alternative training images. Caers (2011) uses different training images to introduce geological model uncertainty into the simulation of oil reservoirs. Park et al. (2013) use history matching to quantify the uncertainty of facies models in the form of alternative training images. Hermans et al. (2014) choose among several geological scenarios in the form of possible training images using geophysical data and Bayes rule to compute the conditional probabilities of the alternative training images given the geophysical data.

With a few notable exceptions, in most mining applications the geological (model) uncertainty from the feasibility stage onwards can be limited to uncertainty in model parameters rather than uncertainty about the general model (e.g., stratiform, vein, disseminated). However, for cases where fundamental (and a priori, unverifiable) assumptions are/must be made about the general model, as in oil and gas applications or applications in which physical processes give rise to the variables (e.g., HDR fracture occurrence and propagation), it is essential to test the sensitivity of these assumptions by reconciling the consistency of outputs (e.g., heat production from a geothermal reservoir) with predicted responses to inputs (e.g., fluid flow through fracture networks). The fundamental difference between these cases and mining applications is that ultimately the latter can be directly observed.

On the assumption that the most important characteristics of the underlying model can be captured in several parameters of a broad model, the uncertainty in the parameter estimates can be quantified by generating a set of parameter values using an appropriate set of rules; simulating the spatial random variable(s) using these parameter values; and repeating this process a sufficiently large number of times. Methods for sampling parameter values include Maximum Likelihood, Bootstrap methods (Olea et al. 2015), Bayesian analysis (Kitanidis 1986) and, in multiple point statistics simulations, Bayesian selection of alternative templates or training images (Park et al. 2013; Hermans et al. 2014) and clustering combined with system responses (Caers 2011).

The following two examples illustrate the use of maximum likelihood in model selection and parameter inference and the propagation of the associated uncertainties into geostatistical simulation for environmental and mining applications.

### *18.5.1 Example: Transmissivity Uncertainty*

This example is taken from Dowd and Pardo-Igúzquiza (2002). The data are from Gotway (1994) and comprise 41 transmissivity measurements in the Culebra Dolomite formation in New Mexico. The original application was for nuclear waste site assessment, where uncertainty in the groundwater travel time of a particle is assessed through its probability density function, which is estimated by running groundwater flow and transport programs with different transmissivity field inputs. These inputs are generated by conditional simulations of transmissivity.

The data are the logarithms of transmissivity in m<sup>2</sup> s <sup>−</sup><sup>1</sup> and the data locations are shown in Fig. 18.7 together with a histogram of the log-transmissivity data.

Maximum Likelihood was used to estimate the parameters of an exponential covariance model of the residuals for drift orders 0, 1 and 2. Although drift is a deterministic component of the universal model, in practice the coefficients are estimated from the available data and are thus random variables with the means and standard errors given in Table 18.2 for the optimal (determined by the Akaike information criterion) drift model of order 1: drift (*x*, *y*) = *β*<sup>0</sup> + *β*<sup>1</sup> *x* + *β<sup>2</sup> y.* The estimated covariance parameters for *k* = 1 are given in Table 18.3 and the variogram is shown in Fig. 18.8.

In this case, as there is no nugget variance, the range and sill are estimated independently. The correlation between range and sill is thus zero and any combination of values of the two parameters inside their respective intervals is inside the 95% confidence region as shown in Fig. 18.9a. The drift coefficients are also independent of the sill and the range. As the estimated drift coefficients are correlated, not every combination of the three parameter values is equally reliable, i.e. values inside the 95% confidence interval of the parameters taken together may not be inside the 95% confidence interval for each individual parameter. The confidence interval is an ellipsoid. Figure 18.9b shows the 95% confidence region for (*β*1, *β*2) when the third coefficient the model is set to the estimated value given in Table 18.3.

**Fig. 18.7** Data locations (distances in km) and histogram of log transmissivity data


**0.8 0.6 0.4**

**0.0 0.2**

The effects of model uncertainty on simulation outputs are illustrated by generating six simulations for each pair of values A, B, C, D and E in Fig. 18.9; each set of simulations was started with the same random number seed. The simulations are shown in Fig. 18.10. The differences between corresponding simulations (e.g., first simulation in each of A, B, C, D and E) for the five sets of parameters reflect the model uncertainty, which could be quantified further by simulating flow and transport through the simulated transmissivity realisations.

**2 4 8 10 6** 

**Distance (lag)**

**12 14 16 18**

### *18.5.2 Example: Coal Resource Risk Assessment*

One of the most significant contributors to the total risk in the evaluation of coal-mining projects is the uncertainty of the resource tonnage and quality characteristics, often called the resource risk. This example is from the As Pontes deposit in Galicia, Spain (Pardo-Igúzquiza et al. 2013). The most significant variable in the assessment of resource uncertainty is the thickness of the coal seam. Figure 18.11 shows the data locations at which seam thickness is measured together with the estimated variogram values and the manually fitted (isotropic) variogram model.

**Fig. 18.9 a** (left) 95% confidence region for sill and range; **b** (right) confidence region for drift parameters *β*<sup>1</sup> and *β*<sup>2</sup> with *β*<sup>0</sup> = −1.6062

**Fig. 18.10** Outputs from six simulations using the variance and range parameters denoted by the mean values A and the extreme values B, C, D and E in Fig. 18.9

**Fig. 18.11** (Left) drill-hole locations and boundary of the study area. (Right) Variogram and manually fitted model for seam thickness

Spherical model variograms for seam thickness:


Although the maximum likelihood estimates of the parameters are very similar to those estimated by visual fitting, maximum likelihood has the advantage of providing estimates of the uncertainty of the parameters. For illustrative purposes, resources were computed as tonnage from panels with thickness above a threshold defined by the 25th percentile of the sample data and equal to a thickness of 8.65 m. The kriged resource volume is 1.97 × 10<sup>8</sup> m<sup>3</sup> .

Sequential Gaussian simulation was used to generate realisations of the thickness of the seam. To quantify the uncertainty in the estimated resource, a total of 870 simulations were generated using the 'certain' variogram (maximum likelihood

parameters) and the total resource was calculated for each simulation. The histogram of the 870 simulated resources quantifies the uncertainty of the estimated resources. An example simulation is shown in Fig. 18.12.

The parameter space {*r*0, *a, σ*<sup>2</sup> } comprising respectively the nugget/variance ratio, range and variance, is used to quantify the uncertainty in the model. The parameter values were divided into discrete steps of 0.05 for *r*<sup>0</sup> in the interval [0, 1]; 700 m for *a* in the interval [1,000, 15,000] and 0.1 for *σ*<sup>2</sup> in the interval [0.6, 2.6]. There are 268 models of triplets *r*0, *a*, *σ*<sup>2</sup> that lie inside the 75% confidence region. As these models are not equally probable, the probabilities are normalised so that they sum to 1.0 and each model is included as many times as indicated by its normalised probability (i.e., probability sampling in which, for example, a model with a normalised probability of 0.35 comprises 35% of the total simulated triplets). A total of 870 simulations were used.

Histograms of the total resources for the 870 simulations, with and without the uncertainty of the variogram model parameters, are given in Fig. 18.13. There is no significant difference in mean resource values for the certain and uncertain values.

The 95% confidence interval for the total resource assuming the variogram is known with certainty is [1.88 × 10<sup>8</sup> , 2.19 × 10<sup>8</sup> ] m3 and [1.90 × 10<sup>8</sup> , 2.23 × 10<sup>8</sup> ] m3 , when the uncertainty of the variogram model is included. The latter is slightly higher than the same interval calculated under the assumption that the variogram is known with certainty. However, the probability that the total resource will be greater than 2.0 × 10<sup>8</sup> m<sup>3</sup> , is 0.59 when the uncertainty of the variogram parameters is ignored and 0.75 when the uncertainty of the variogram parameters is propagated into the simulated realisations. In other words, whilst there is no significant difference in the mean resource for the two sets of simulations, the difference in the two distributions (because of different variances) is sufficient to generate significantly different resource estimates above selected cut-offs.

**Fig. 18.13** Histograms of total resources calculated by geostatistical simulation assuming the variogram model parameters are known with certainty (solid line) and including the uncertainty of the semi-variogram model parameters (dashed line)

In this case, the differences in the total volume of resources, with and without quantification of semi-variogram uncertainty, are small but the consequence of selecting from the distribution of possible resources is significant. This illustrates a general principle: the estimated total resource and the mean simulated resource, with and without semi-variogram uncertainty, may not differ significantly but the distributions of the two simulations will differ because of the different variances. Similarly, selecting panel values above a threshold from the set of estimated panel thicknesses or from a set of simulated panel thicknesses will yield different results.

In general, the outcome from the simulations with and without semi-variogram uncertainty depends on the deposit and the amount of data available. Evaluation of model uncertainty is critical in resource risk assessment even if it is ultimately found that there is no practical difference between resource estimates obtained by ignoring or including semi-variogram uncertainty. This example also has important implications for compliance with resource and reserve reporting codes, most of which use terms such as, or equivalent to, *the amount of error* [associated with an estimate], *the level of accuracy* [of an estimate], the *level of confidence* [in a reserve statement], and *levels of geological confidence* (words in italics are quoted from JORC 2012). Whilst all reporting codes currently use these terms qualitatively they all have specific quantitative meanings in statistics, probability and risk assessment and are increasingly being referred to explicitly in reporting codes.

### **18.6 Quantifying the Effects of Transfer Uncertainty**

An example of passive transfer uncertainty is the variation in open-pit size and shape as a function of grade uncertainty as shown in Fig. 18.14 taken from a study of a small gold orebody (Dowd 1995, 1997). The impacts of these types of uncertainty can be quantified by standard applications of geostatistical simulation. Dimitrakopoulos and co-workers have made significant contributions to the integration of in-situ grade and geological uncertainty into optimization algorithms (e.g., Dimitrakopoulos et al. 2002; Goodfellow and Dimitrakopoulos 2013).

More challenging is the impact of propagating in-situ uncertainty through the mining (extraction) process. The critical component of most metalliferous open-pit mining operations is ore selection, i.e. the minimisation of ore loss and ore dilution during extraction. In general, extraction comprises drilling, blasting and loading, all of which are planned and designed on uncertain models of local geology and grade. The conversion of the in-situ block model resource to a realistically recoverable reserve may, in many instances, be the most significant source of uncertainty in reserve estimation. The usual assessment of recoverable reserves, for example, is limited to a simple volumetric exercise in which ore recovery is assessed as a function of applying a range of selection volumes to a simulated orebody or an even simpler volume-based adjustment of the variance of estimated block values. These simplistic approaches ignore the practicalities of the mining, selection and loading processes—blast design, behaviour and performance; equipment type, size and

**Fig. 18.14** Optimal open pits generated from 100 simulations of a small gold orebody. Top: maximum volume; centre: median volume; bottom: minimum volume

operation; ore displacement during blasting and loading; and ability to identify ore zones within a blast muck pile. In many applications, the uncertainties introduced by these technical processes are at least as significant as those that derive from the in-situ spatial characteristics of grades and geology.

An approach to quantifying transfer process uncertainty for blasting and loading comprises:


The in-situ model, representing perfect knowledge at all relevant scales, is obtained by geostatistical simulation. An in-situ model that represents the reality of knowing only the data and information that are available from specific grade control drilling and sampling grids can be obtained by sampling the geostatistically simulated model on a specified grid. The volumes comprising the in-situ model are then populated by estimates based only on the data corresponding to the specified grade-control drilling and sampling grids. Different drilling and sampling grids can be used to generate different models, each reflecting the levels of data and information available. Selectivity can then be assessed as a function of the drilling and sampling grids as well as the size and type of loader. Performance is assessed against the ideal selectivity that can be achieved on the perfect knowledge model, comprising the simulated values of each component volume. Applying costs, prices and financial criteria enables an optimal selection of the grade control drilling grid, size of loader, type of loading and even blast design.

The following case study (Dowd and Dare-Bryan 2004) is based on the Minas de Rio Tinto SAL open-pit copper mine at Rio Tinto, southern Spain, which is typical of a low-grade operation in the later stages of its life. Ore/waste delineation for selective mining is difficult because the head grades are near the economic cut-off grade and there are no clear geological controls on the mineralisation.

Sequential Gaussian simulation, with the blast-hole grades as conditioning data, was used to generate realisations of each mining bench on a block grid of 0.5 m × 0.5 m × 0.5 m, the grid determined based on blast and selection criteria.

**Fig. 18.15 a** simulated copper grades in a bench: three horizontal sections; **b** four vertical sections; **c** blast profile resulting from simulated blast applied to simulated grades; **d** predicted composition of blast profile from simulated blast applied to in-situ grades estimated from samples taken from blast-holes on 8 m spacing

**Fig. 18.16** (Left) selected ore volumes based on estimates (Right) actual ore volumes

The first aspect of predicting recovery is the in-situ heterogeneity of the ore and the extent to which it forms contiguous 'parcels' of a size relative to the selection size (capacity and size of loading equipment). The second aspect is the heterogeneity of the ore after it has been subjected to blasting (i.e., the in-situ geological spatial variability and the post-transfer in-situ blast-pile spatial variability).

Figure 18.15 shows horizontal and vertical cross-sections through a simulated bench of dimensions 80 m × 40 m × 12 m (height) simulated copper grades on horizontal planes at the top and bottom of a 12 m bench height and a 6 m mid-plane. The vertical cross-sections of the bench are extremities (0 and 80 m) and intermediate planes at 28 m intervals.

Figure 18.16 shows the assumed contiguous parcels of ore in the blast pile based on estimated in-situ grade values together with the actual (simulated) parcels of ore. A comparison of the two sets of ore volumes in Fig. 18.16 would quantify ore loss and ore dilution. Blast movement sensors, inserted in drill holes and detected in the blast-pile, are widely used to identify post-blast ore parcels. In such cases, this process would quantify the uncertainty associated with the initial placement of sensors based on estimated in-situ ore locations and a grade continuity model.

Among other examples, Goodfellow and Dimitrakopoulos (2017) describe an approach that integrates sources of uncertainties arising from the combined production of several mines. The in-situ orebody uncertainties are integrated with process uncertainties from extraction to processing to marketing as the basis of modelling and stochastically optimising the value chain of a mining complex.

### **18.7 Conclusion**

There is a growing requirement for integrated frameworks for uncertainty quantification in all geologically based applications. Quantified uncertainty and geostatistical methods are increasingly being referenced explicitly in mineral resource and reserve codes. This does not require rewriting the reporting codes but it does mean that there is a need to establish a general accepted framework for the quantification of all sources of uncertainty.

Quantified risk assessments for environmental applications are now required in many jurisdictions for applications such as waste burial and the treatment, storage and disposal of radioactive material. These assessments are required to cover time periods that range from around 200 years for household wastes to thousands of years for the underground storage or disposal of radioactive wastes.

The management of groundwater resources, especially karst systems in environmentally vulnerable coastal areas, requires the integration of flow, extraction, seawater intrusion, contamination from agriculture and other activities.

In these and all such applications the identification and quantification of all sources of uncertainty is critical to ensuring reliable estimation, planning, design and, for resource extraction, production and to managing associated risks. As summarised here, many methods and approaches have been developed by many authors but most are limited to aleatory uncertainty.

The work summarised here provides examples of methods that have been successfully applied to identify and quantify all sources of uncertainty in mineral resource and environmental applications. They provide a contribution to the need, and the increasing requirement, to develop integrated frameworks for uncertainty quantification in all geologically based applications.

**Acknowledgements** I am grateful to my co-authors of our cited joint publications and particularly to Eulogio Pardo-Igúzquiza with whom I have collaborated for over 20 years.

### **References**

Bandemer H, Gebhart A (2000) Bayesian fuzzy kriging. Fuzzy Sets Syst 112:405–418


Diamond P (1989) Fuzzy kriging. Fuzzy Sets Syst 33:315–332


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **Chapter 19 Advances in Sensitivity Analysis of Uncertainty to Changes in Sampling Density When Modeling Spatially Correlated Attributes**

**Ricardo A. Olea**

**Abstract** A comparative analysis of distance methods, kriging and stochastic simulation is conducted for evaluating their capabilities for predicting fluctuations in uncertainty due to changes in spatially correlated samples. It is concluded that distance methods lack the most basic capabilities to assess reliability despite their wide acceptance. In contrast, kriging and stochastic simulation offer significant improvements by considering probabilistic formulations that provide a basis on which uncertainty can be estimated in a way consistent with practices widely accepted in risk analysis. Additionally, using real thickness data of a coal bed, it is confirmed once more that stochastic simulation outperforms kriging.

### **19.1 Introduction**

In any form of sampling, there is always significant interest in establishing the reliability that may be placed on any conclusions extracted from a sample of certain size. In the earth sciences and engineering, such conclusions can be the extension of a contamination plume or the in situ resources of a mineral commodity. Increases in sample size result in monotonic improvements with diminishing returns: up to measuring the entire population, the benefits increase with the number of observations. In the classical statistics of independent random variables, the number of observations is all that counts. In spatial statistics, however, the locations of the data are also important.

Early on in spatial sampling, it was recognized that sampling distance was a factor in determining the reliability of estimations. However, insurmountable difficulties of incorporating other factors led to the reliability of spatial samplings

R. A. Olea (✉)

U.S. Geological Survey, 12201 Sunrise Valley Drive, Mail Stop 956, Reston, VA 20192, USA e-mail: rolea@usgs.gov

<sup>©</sup> The Author(s) 2018 B. S. Daya Sagar et al. (eds.), *Handbook of Mathematical Geosciences*, https://doi.org/10.1007/978-3-319-78999-6\_19

being determined solely by geographical distance, particularly for the public disclosure of mineral resources (e.g., USBM and USGS 1976).

Significant advances in the determination of spatial uncertainty did not take place until the advent of digital computers and the formulation of geostatistics (e.g., Matheron 1965). Geostatistics introduced the concept of kriging variance, which was a significant improvement over the relatively simplistic distance criteria for determining reliability. The third generation of methods to determine reliability of spatial sampling came with the development of spatial stochastic simulation shortly after the formulation of kriging (Journel 1974).

Although there are several reports in the literature about applications of distance methods (e.g., USGS 1980; Wood et al. 1983; Rendu 2006) and kriging (e.g., Olea 1984; Bhat et al. 2015), the mere fact that distance methods are still being used indicates that the merits of the geostatistical methods remain unappreciated. This chapter is an application of the three families of methods for conducting sensitivity analyses on the reliability of the assessment of geologic resources due to variations in sample spacing. The simulation formulation given here is novel as it is an illustrative example used for comparing all three approaches.

### **19.2 Data**

The data in Fig. 19.1 and Table 19.1 of the Appendix will be used to anchor the presentation. They are thickness measurements for the Anderson coal bed in a central part of the Gillette coal field of Wyoming taken from a more extensive study (Olea and Luppens 2014). A conversion factor could have been used to transform

**Fig. 19.1** Measurements of thickness for the Anderson coal bed in a central part of the Gillette coal field, Wyoming, USA: **a** posting of values; **b** histogram

all the thickness values to tonnage, but it was decided to perform the analysis in terms of the attribute actually measured. The reader may want to know, however, that a density of 1,770 short tons per acre-foot for subbituminous coal is a good average value to estimate tonnage values and that the cell size used here is 400 ft by 400 ft.

With resources of more than 200 billion short tons of coal in place, the Gillette coal field is one of the largest coal deposits in the United States (Luppens et al. 2008). There are eleven beds of importance in the field. The Anderson coal bed, in the Paleocene Tongue River Member of the Fort Union Formation, is the thickest and most laterally continuous of the six most economically significant beds. This low sulfur, subbituminous coal has a field average thickness of 45 ft. Hence, it is the main mining target.

### **19.3 Traditional Uncertainty Assessment**

For a long time, the prevailing practice has been the determination of uncertainty in mining assessments based on distance between drill holes. Figure 19.2 shows an example following U.S. Geological Survey Circular 891 (Wood et al. 1983), hereafter referred to as Circular 891. This example uses the drill holes in Fig. 19.1a after eliminating the holes along the diagonal. Circular 891 classifies resources into four categories according to the distance from the estimation location to the closest drill hole:


**Fig. 19.2** Classification of in situ resources according to Circular 891 for the data in Fig. 19.1a after eliminating the drill holes along the diagonal

Classification schemes like this are fairly simple and gained popularity prior to the advent of computers. Evaluating the degree of uncertainty of a magnitude or an event is the domain of statistics (e.g., Caers 2011). The standard approach for analyzing uncertainty consists of listing all possible values or events and then assigning a relative frequency of occurrence. A simple example is the tossing of a coin, where the outcomes are head and tail. For a fair coin, these two events occur with the same frequency, which is called probability when normalized to vary from 0 to 1. The same concept can be applied to any event or attribute, including coal bed thickness. For example, the outcome at a site not yet drilled could be modeled as the following random variable:


Note that the sum of the probabilities of all possible outcomes is 1.0. Random variables rigorously allow answering multiple questions about unknown magnitudes, in this case, the likely thickness to penetrate. A sample of just three assertions would be: (a) coal will certainly be intersected because the value zero is not listed among the possibilities; (b) it is more likely that the intersected thickness will be less than 15 ft than greater than 15 ft; and (c) odds are 6 to 4 that the thickness will be between 10 and 21 ft, or to put it differently, the 11 ft interval between 10 and 21 ft has a probability of 0.6 of containing the true thickness. These are the standard concepts and tools used universally in statistics to characterize uncertainty.

The classification system established by Circular 891 does not use probabilities and lacks the predictive power of a random variable approach. In particular,


resources considering all coal beds, while logic indicates that the extension of true reliability classes should be all different.


Despite these drawbacks and the formulation of the superior alternatives below, Circular 891 and similar approaches remain the prevailing methods worldwide for the public disclosure of uncertainty in the assessment of mineral resources and reserves (JORC 2012; CRIRSCO 2013).

### **19.4 Kriging**

Kriging is a family of spatial statistics methods formulated for the improvement in the reporting of uncertainty and in the estimation of the attributes of interest themselves. Although it is possible to establish links between kriging and other older estimation methods in various disciplines, mining was the driving force behind the initial developments of kriging and other related methods collectively known today as geostatistics (Cressie 1990).

Kriging is basically a generalization of minimum mean square error estimation taking into account spatial correlation. Kriging provides two numbers per location <sup>ð</sup>**s***<sup>o</sup>*<sup>Þ</sup> conditioned to some sample of the attribute <sup>ð</sup>*z*ð**s***<sup>i</sup>*Þ, *<sup>i</sup>*= 1, 2, ... , *<sup>N</sup>*Þ: an estimate of the unknown value <sup>ð</sup>*z*\*ð**s***<sup>o</sup>*ÞÞ and a standard error <sup>ð</sup>*σ*ð**s***<sup>o</sup>*ÞÞ. The exact expression for these results depends on the form of kriging. For ordinary kriging, the most commonly applied form and the one used here, the equations are:

$$\mathbf{z}^\*(\mathbf{s}\_o) = \sum\_{i=1}^n \lambda\_i \cdot \mathbf{z}(\mathbf{s}\_i) \tag{19.1}$$

$$\sigma^2(\mathbf{s}\_o) = \left(\sum\_{i=1}^n \lambda\_i \cdot \chi(\mathbf{s}\_o, \mathbf{s}\_i)\right) - \mu \tag{19.2}$$

where:

*n*≤ *N* is a subset of the sample consisting of the observations closest to **s***o*;

*γ*ð**d**Þ is the semivariogram, a function of the distance **d** between two locations;


The method presumes knowledge of the function characterizing the spatial correlation between any two points, which is never the case. A structural analysis must be conducted before running kriging to estimate this function: a covariance or semivariogram. The semivariogram can be regarded as a scaled distance function. The weights and the Lagrange multiplier depend on the semivariogram for multiple drill-hole to drill-hole distances and estimation location to drill-hole distances. For details, see for example Olea (1999).

The two terms, *<sup>z</sup>*\*ð**s***<sup>o</sup>*<sup>Þ</sup> and *<sup>σ</sup>*<sup>2</sup>ð**s***<sup>o</sup>*Þ, are the mean and the variance of the random variable modeling the uncertainty of the true value of the attribute *<sup>z</sup>*ð**s***<sup>o</sup>*Þ, terms that are compatible with all that is known about the attribute through the sample of size *N*. Variance is a measure of dispersion, in this case, dispersion of possible values around the estimate, which is the most likely value. Hence, changing the sample, a sensitivity analysis of kriging variance is a sensitivity analysis of variations in uncertainty due to changes in the sampling scheme. From Eq. 19.2, the kriging variance does not depend directly on the observations. The dependence is only indirect through the semivariogram, which is based on the data. Considering that there is one true semivariogram per attribute, changes in adequate sampling should not result in significant changes in the estimated semivariogram, which is kept constant. This independence between data and standard error facilitates the application of kriging to the sensitivity analysis in the reliability of an assessment due to changes in sampling strategy because mathematically actual measurements are not necessary to calculate standard errors; the modeler only has to specify the semivariogram and the sampling locations.

Figure 19.3 shows the set of estimated semivariogram values obtained using the sample in Fig. 19.1 plus a model fitting the points for the purpose of having valid semivariogram values for any distance. In this case, the fitted curve is called a spherical model with a nugget of 20 sq ft, sill of 595 sq ft and a spatial correlation range of 88,920 ft. Geologically, the nugget is related to the variance of short scale fluctuations; the sill is of the same order of magnitude as the sample variance, and the correlation range is equal to half the average geographical size of the anomalies. For details on structural analysis, see for example Olea (2006).

Figure 19.4 shows the results of applying ordinary kriging to the sample in Fig. 19.1a and Table 19.1 in the Appendix. As expected, the standard error is zero at the drill holes because there is no uncertainty where measurements have been taken.

Although kriging can analyze any configuration, Fig. 19.5 only relates to additions or eliminations to the basic sample in Fig. 19.1a. Values along the diagonal were used only for modeling the semivariogram and producing Fig. 19.4. Figure 19.5a also has every other row and column eliminated. Estimates could be produced for the first two configurations because thickness is known at each drill hole. The other maps were produced by interpolating locations in the sample with

**Fig. 19.3** Semivariogram for the Anderson coal bed thickness. The crosses denote estimated values and the curve is a model fitting the values

**Fig. 19.4** Ordinary kriging maps for the Anderson coal bed in a central part of the Gillette coal field (Wyoming) using the sample in Fig. 19.1: **a** thickness; **b** standard error

**Fig. 19.5** Ordinary kriging standard error for the same configuration in Fig. 19.2 for several average spacings: **a** 6 mi; **b** 3 mi; **c** 1.5 mi; **d** 3/4 mi; **e** 3/8 mi

the next largest spacing; it is only possible to produce the standard error map for Fig. 19.5c–e.

The similarity between Figs. 19.2 and 19.5b may lead to incorrect conclusions. Although the location and extension of similar colors are approximately the same, what is important is the meaning of the colors. Figure 19.2 does not provide any numerical information that can be associated with the accuracy and the precision of the estimated values. In Fig. 19.5b the numbers are standard errors, a direct measurement of estimation reliability. In other more irregular configurations, there will not be similarity in color patterns no matter how the colors are selected. For example, by expanding the boundary of the study area, Fig. 19.6 shows how the Circular 891 classification is totally insensitive to the fact that, along the periphery, there is an increase in uncertainty because the data are now to one side, not surrounding the estimation locations. Instead, kriging accounts for the fact that extrapolation is always a more uncertain operation than interpolation, an important capability when accounting for boundary effects.

Kriging is able to provide random variables for the statistical characterization of uncertainty if the modeler is willing to introduce a distributional assumption. *<sup>z</sup>*\*ð**s***<sup>o</sup>*<sup>Þ</sup> and *<sup>σ</sup>*<sup>2</sup>ð**s***<sup>o</sup>*<sup>Þ</sup> are the mean and the variance of the distribution of the random variable providing the likely values for *<sup>z</sup>*ð**s***<sup>o</sup>*Þ. These parameters are necessary but not sufficient to fully characterize any distribution. However, this indetermination can be eliminated by assuming a distribution that is fully determined with these two parameters. Ordinarily, the distribution of choice is the normal distribution, followed by the lognormal. The form of the distribution does not change by subtracting *<sup>z</sup>*ð**s***<sup>o</sup>*<sup>Þ</sup> from all estimates. As the difference *<sup>z</sup>*\*ð**s***<sup>o</sup>*Þ<sup>−</sup> *<sup>z</sup>*ð**s***<sup>o</sup>*<sup>Þ</sup> is the estimation error, the distributional assumption also allows characterizing the distribution for the error at **s***o*.

**Fig. 19.6** Comparison of results when expanding the boundaries of the study area: **a** Circular 891 classification; **b** ordinary kriging standard error

Kriging with a distribution for the errors overcomes all the disadvantages of the distance methods listed in the previous section:


Figure 19.7 summarizes the results of the maps in Fig. 19.5. Display of the 95th percentile is based on the assumption that all random variables follow normal distributions. The curves clearly outline the consequences of varying the spacing in

a square sampling pattern from 2,000 to 32,000 ft. So, for example, if it is required that all estimates in the study area must have a standard error less than 10 ft, then the maximum spacing must be at most 12,500 ft. The validity of the results, however, is specific to the attribute and sampling pattern: thickness of the Anderson coal bed investigated with a square grid. Any change in these specifications requires preparation of another set of curves.

### **19.5 Stochastic Simulation**

Despite limited acceptance, the kriging variance has been in use for a while in the sensitivity analysis of uncertainty to changes in sampling distances and configurations (e.g., Olea 1984; Cressie et al. 1990). Kriging, like any mathematical method, has been open to improvements. One result has been the formulation of another family of methods: stochastic simulation.

Relative to the topic of this chapter, stochastic simulation offers two improvements: (a) it is no longer necessary to assume the form for the distribution providing all possible values for the true value of the attribute *<sup>z</sup>*ð**s***<sup>o</sup>*Þ; and (b) the standard error is sensitive to the data.

As seen in Fig. 19.4, for every attribute and sample, kriging produces two maps, a map of the estimate and a map of the standard error. The idea of stochastic simulation is to characterize uncertainty by producing instead multiple attribute maps, all compatible with the data at hand and each representing one possible outcome of reality—realization, for short. From among the many available methods of geostatistical simulation, sequential Gaussian simulation has been chosen for this study because of its simplicity, versatility and efficiency (Pyrcz and Deutsch 2014). Figure 19.8 shows four simulated realizations, each of which is a possible reality in the sense that the values have the same statistics and spatial statistics (semivariogram) and the simulation reproduces the known sample values (i.e., the sample used to prepare Fig. 19.5b).

Generation of significant results needs preparation of more realizations than the four in Fig. 19.8. An estimation of uncertainty requires summarizing the fluctuations from realization to realization, either at local or global scales. Figure 19.9 is an example of local fluctuation summarizing all values of thickness at the same location for 100 realizations. This histogram is the numerical characterization of uncertainty through a random variable. There is one random variable for each of the 57,528 pixels (cells) comprising each realization. As clearly implied by the selected values in the tabulation, this collection of 100 maps provides multiple predictions of the true thickness value that should be expected at this location. For example, the most likely value (mean) is 65.75 ft; the standard error is 13.47 ft; and there is a 0.95 probability that the coal bed will be less than 87.8 ft thick.

Maps can be generated for various statistics across the study area to display fluctuations in their values. Figure 19.10 shows a map of the mean and a map of the standard error. Note that the map for the mean is quite similar to the ordinary kriging map in Fig. 19.4a. More importantly, the maps for the standard errors in

**Fig. 19.8** A sample of four sequential Gaussian realizations using the same data used in the preparation of Fig. 19.5b

Figs. 19.5b and 19.10b are significantly different. The differences in the standard errors are primarily the result of the dependency of the standard error not only on the semivariogram and the drill hole locations, but also on the values of thickness as well. For example, comparing Figs. 19.1a and 19.10b, despite the regularity in the drilling, there is less uncertainty in the southwest corner where all values are low as well as in the south central part where all values are consistently high.

Production of a display of the standard error equivalent to that in Fig. 19.5 is more challenging now that the standard deviation must be extracted from multiple realizations and the preparation of each realization requires a value at each drill hole

**Fig. 19.9** Example of the numerical approximation to the random variable modeling uncertainty in the value of thickness at a site not yet drilled

**Fig. 19.10** Anderson coal bed thickness according to 100 sequential Gaussian simulations: **a** expected value of thickness; **b** standard error

in the configuration of interest to complete the analysis. Figure 19.11 shows the equivalent results to Fig. 19.5 for the same drill holes, but now produced after applying sequential Gaussian simulation. The additional data necessary to prepare the maps in Fig. 19.11c–e where obtained by randomly selecting 10 of the 100 realizations used to prepare the maps in Figs. 19.8 and 19.10. The data for the hypothetical drill holes were taken from the values at the collocated nodes in these selected 10 realizations, thus obtaining 10 datasets consisting partly of the 48 actual data in Fig. 19.11b plus the artificial data obtained by "drilling" the realizations.

mi

**Fig. 19.11** Sequential Gaussian simulation standard error for the same configuration on Fig. 19.2 for several average spacings: **a** 6 mi; **b** 3 mi; **c** 1.5 mi; **d** 3/4 mi; **e** 3/8 mi

mi

Finally, each dataset was used to generate 100 realizations, for a total of 1,000 realizations per configuration. As mentioned for Fig. 19.10b, despite the regularity of the drill hole pattern, the fluctuations in standard error are no longer completely determined by the drilling pattern.

Figure 19.12 is the summary equivalent to that in Fig. 19.7. Considering the completely different methodologies behind both sets of curves, the results are quite similar, particularly the curves for the mean standard error, which are almost identical. The more extreme standard errors of the sequential Gaussian simulation are larger than those for ordinary kriging in the case of the 95th percentile and the maximum value. The remaining question is: Which approach produces the most realistic forecasts of uncertainty?

### **19.6 Validation**

Figure 19.13 provides an answer to the question above in terms of percentiles. A percentile is a number that separates a set of values into two groups, one below and the other one above the percentile. The percentage of values below gives the name to the percentile. For example, in Fig. 19.9, the value 46.22 ft separates the 100 values of thickness into two classes, those below and those equal to or above 46.22 ft. It turns out that only 5 of the 100 values are below 46.22 ft. Hence, 46.22 ft is the 5th percentile of that dataset. Accepting only integer values of percentages, there are 99 percentiles in any dataset. The quality of a model of uncertainty can be validated by checking the proportion of true values that are actually below the percentiles of the prediction random variables collocated with data not used in the

**Fig. 19.13** Validation of the uncertainty predictions made for the 3 mi spacing samples: **a** ordinary kriging; **b** sequential Gaussian simulation

modeling. One of the reasons for selecting the Anderson coal thickness for the study is that there are much more data than the 48 values used to generate the realizations, a generous set of 2,136 additional values to be precise. This larger number of values has been used for checking the accuracy of the percentiles, not only the 5th percentile, but all 99 percentiles. In the graphs, the actual percentage shows, on average, the proportion of times the true value was below the percentile of a random variable at the location of a censored measurement. For example, in Fig. 19.13a, 641 times out of 2,136 (i.e., 30%) the true value was indeed below the 35th percentile. Ideally, all dots should lie along the main diagonal. The clear winner is sequential Gaussian simulation.

### **19.7 Conclusions**

Distance methods, kriging and stochastic simulation rank, in that order, in terms of increasing detail and precision of the information that they are able to provide concerning the uncertainty associated to any spatial resource assessment.

The resource classification provided by distance methods is completely independent of the geology of the deposit and the method applied to calculate the mineral resources. The magnitude of the resource per class has no associated quantitative measure of the deviation that could be expected between the calculated resource and the actual amount in place.

The geostatistical methods of kriging and stochastic simulation base the modeling on the concept of random variable used in statistics, which allows the same type of probabilistic forecasting used in other forms of risk assessments. Censored data were used for validating the accuracy of the probabilistic predictions that can be made using the geostatistical methods. The results were entirely satisfactory, particularly in the case of stochastic simulation.

**Acknowledgements** This contribution completed a required review and approval process by the U.S. Geological Survey (USGS) described in Fundamental Science Practices (http://pubs.usgs. gov/circ/1367/) before final inclusion in this volume. I wish to thank Brian Shaffer and James Luppens (USGS), Peter Dowd (University of Adelaide) and Josep Antoni Martín-Fernández (Visiting Fulbright Scholar, USGS) for suggestions leading to improvements to earlier versions of the manuscript.

### **Appendix**

See Table 19.1.


**Table 19.1** Thickness data. ID = identification number; Thick. = thickness; ft = feet

### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **Chapter 20 Predicting Molybdenum Deposit Growth**

**John H. Schuenemeyer, Lawrence J. Drew and James D. Bliss**

**Abstract** In the study of molybdenum deposits and most other minerals deposits, including copper, lead and zinc, there is speculation that most undiscovered ore results from an increase (or "growth") in the estimated size of a known deposit due to factors such as exploitation and advances in mining and exploration technology, rather than in discovering wholly new deposits. The purpose of this study is to construct a nonlinear model to estimate deposit "growth" for known deposits as a function of cutoff grade. The model selected for this data set was a truncated normal cumulative distribution function. Because the cutoff grade is commonly unknown, a model to estimate cutoff grade conditioned upon the deposit grade was constructed using data from 34 deposits with reported data on molybdenum grade, cutoff grade, and tonnage. Finally, an example is presented.

**Keywords** Porphyry molybdenum ⋅ Deposit growth ⋅ Cutoff grade Truncated cumulative distribution model fitting and estimation ⋅ Confidence and prediction intervals for nonlinear estimation

### **20.1 Introduction**

Initial estimates of a mineral deposit size based on limited data usually underestimate the ultimate size of a mineral deposit, often by a significant amount. The initial size estimate may be of only marginal interest but the size estimate after some exploration and development can be of significant interest. The steps in this process are the subject of this chapter. "Mineral resources" are defined as concentrations or occurrences of material of economic interest in or on the Earth's crust in such form, quality, and quantity that there are reasonable prospects for eventual economic extraction (Zientek and Hammarstrom 2014), and the term "mineral reserves" is restricted to the economically mineable part of a mineral resource.

J. H. Schuenemeyer (✉) <sup>⋅</sup> L. J. Drew <sup>⋅</sup> J. D. Bliss

Southwest Statistical Consulting, LLC, Cortez, CO, USA e-mail: jackswsc@q.com

<sup>©</sup> The Author(s) 2018 B. S. Daya Sagar et al. (eds.), *Handbook of Mathematical Geosciences*, https://doi.org/10.1007/978-3-319-78999-6\_20

The reported size of known mineral or oil and gas deposit reserves recorded in the mining literature typically increases through time as subsequent development drilling and mining enlarge the deposit's footprint. This phenomenon is referred to as "deposit growth". In a sense, a deposit is never finished "growing" until it is completely mined out. Research on the growth of a deposit's reserves has been a topic of investigation for many years within the United States Geological Survey. Drew (1997) illustrated the growth of oil and gas fields over time in the United States and determined that a large percentage of the ultimate production of a region could come from deposit growth, if the forecast was made early enough in the discovery process. Long (2008) defined reserve growth as the ratio of current reserves plus past production to original reserves. He examined reserve growth in porphyry copper deposits and found that about 20% of porphyry copper mines in the Western Hemisphere had experienced reserve growth of a factor of 10 or better over initial reserves. Reserve growth at these mines added reserves comparable in size to reserves added through discovery of new deposits during the same time period.

Three variables are required to estimate the ultimate size of a deposit: (1) the grade of the deposit, (2) cutoff grade of the deposit, and (3) associated tonnage of ore at successive points in the development of the deposit (Long 2008). The grade of a deposit is defined as the relative quantity of ore mineral within the orebody, typically expressed as a percentage (or g/t). The grade may vary across an orebody, but commonly an average grade may be applied to the orebody as a whole. A cutoff grade is the lowest grade of mineralized material that qualifies as economically mineable and available in a given deposit (Committee for Mineral Reserves International Reporting Standards 2006). Mined material with a grade below the cutoff grade is not processed into metal but is set aside. As deposit development and mining progress, over time the cutoff grade usually declines in an orderly manner. Tonnage is typically reported in metric tons (mt) and includes the mass of total production, reserves and resources of pre-mined material.

The purpose of this study was to construct a nonlinear model to estimate the incremental deposit "growth" for known mineralized areas as a function of cutoff grade, using porphyry molybdenum deposits as an example. Porphyry molybdenum deposits are related to granitic plutons, mostly of Tertiary age, and are formed by hydrothermal fluids associated with the emplacement of granites. They typically occur as large tonnage, low-grade deposits that are commonly mined using open-pit methods.

Two issues must be addressed to predict porphyry molybdenum deposit growth. The first is that, in many instances, the cutoff grade is not available for a given deposit and thus must be estimated. Thus, the first part of this study uses the known molybdenum grade of a deposit to predict probable cutoff grade. The second part of this study in turn uses this predicted cutoff grade to estimate deposit growth as a function of cutoff grade. Two data sets were used in this study. Nearly all porphyry molybdenum deposits used in this study are for unworked deposits; that is, deposits that have been delineated by drilling but are yet unmined. The first data set (Appendix 1) consists of 34 porphyry molybdenum deposits used to model molybdenum cutoff grade in percent (COG) as a function of molybdenum deposit grade, also expressed in percent. The second data set (Appendix 2) is used to model the deposit growth as a function of cutoff grade. The references to Appendices 1 and 2 are Barnes et al. (2009), Baudry (2009), Becker et al. (2009), British Columbia Ministry of Energy and Mines (2012, 2014a, b), Chen and Wang (2011), Ewert et al. (2008), General Moly (2012), Geological Survey of Finland (2011), Geoscience Australia (2012), Kramer (2006), Lowe et al. (2001), Ludington and Plumlee (2009), Mercator Minerals (2011), Mindat.org (1992, 2011), Nanika Resources Inc (2012), Northern Miner (2010), Raw Minerals Group (2011), RX Exploration Inc (2010), Singer et al. (2008), Smith (2009), Taylor et al. (2012), Thompson Creek Metals Company Inc (2011), TTM Resources Inc (2009), US Geological Survey (2011), Wu et al (2011), Yukon Geological Survey (2005). The authors know of no subset of publications that cite the deposits presented in Appendices 1 and 2.

### **20.2 Cutoff Grade as a Function of Deposit Grade**

The first and most straightforward of the two models to analyze is the relationship between molybdenum cutoff grade (Mo COG, %) as a function of molybdenum deposit grade (Mo Grade, %) for the 34 deposits shown in Appendix 1. A scatter plot between these two variables plus a fitted linear regression line, 95% confidence intervals, and 95% prediction intervals are shown in Fig. 20.1.

**Fig. 20.1** Cutoff grade (COG, %) versus deposit grade (%) plus a fitted linear model and the 95% confidence intervals (dashed lines) and corresponding prediction intervals (dotted lines) for the 34 deposits (Appendix 1)

**Fig. 20.2** Residuals versus deposit grade for the linear model fit (Fig. 20.1)

The model to fit cutoff grade *U* as a function of deposit grade *D* is

$$U = \begin{cases} 0 & 0 \le D < c \\ \beta\_0 + \beta\_1 D + \varepsilon & D \ge c \end{cases}$$

where *<sup>ε</sup>* is the random error, assumed to be normal *<sup>N</sup>*ð0, *<sup>σ</sup>*<sup>2</sup>Þ. The constant *<sup>c</sup>* is determined from the linear regression fit since the COG ≥0.

The fitted model is:

$$\hat{U} = \begin{cases} 0 & 0 \le D < c = 0.0159\\ \beta\_0 + \beta\_1 D = -0.01042 + 0.6553D & D \ge 0.0159 \end{cases}$$

where *U*̂is the estimated cutoff grade in percent and *D* is the deposit grade in percent. The residual standard error is 0.012 on 32 degrees of freedom and the adjusted R2 = 0.61. The model is statistically significant and reasonable for the given data set. The residual plot is shown in Fig. 20.2.

There is no evidence to suggest that the residuals are non-normal. Thus, within the domain of the deposit grade, namely from 0.03 to 0.13, the linear model shown above appears to be appropriate. Predictions outside of this interval will depend on the same linear relationship holding.

### **20.3 Deposit Growth as a Function of Cutoff Grade**

The second model is the fraction of growth as a function of estimated cutoff grade. In this example the growth data (Fig. 20.3) consists of 58 observations from eight deposits (Appendix 2). The inverse S shaped form of the data corresponds to an inverse cumulative distribution function. Therefore, this relationship is modeled as an inverse cumulative distribution function, since the fraction growth is a number between 0 and 1, inclusive. Several models including the gamma, lognormal, normal and their left truncated forms were candidates to fit this data. Of these, the left truncated normal was the best fit by visual inspection and by a nonlinear least squares fit. The form of the left truncated normal probability distribution function is:

$$f\_L(\boldsymbol{x}|\boldsymbol{\Theta}) = \frac{f(\boldsymbol{x}|\boldsymbol{\Theta})}{1 - F(\boldsymbol{\lambda}|\boldsymbol{\Theta})} \quad \boldsymbol{x} > \boldsymbol{\lambda}$$

where *<sup>Θ</sup>*′ <sup>=</sup> <sup>ð</sup>*μ*, *<sup>σ</sup>*<sup>2</sup><sup>Þ</sup> and the left truncation point *<sup>λ</sup>* is assumed known. The probability density function for the normal distribution with mean *μ* and standard deviation *σ* is:

$$f(\mathbf{x}|\mu, \sigma^2) = \frac{\mathbf{e}^{-(x-\mu)^2/2}}{\sqrt{2\pi}\sigma}$$

The corresponding left truncated cumulative distribution function, cdf, is:

$$F\_L(\mathbf{x}|\Theta) = \frac{F(\mathbf{x}|\Theta) - F(\boldsymbol{\lambda}|\Theta)}{1 - F(\boldsymbol{\lambda}|\Theta)}, \quad \boldsymbol{\alpha} > \boldsymbol{\lambda}$$

The truncated distributions' models used for model fitting are from the package truncdist (r-project.org) by Novomestky and Nadarajah (2012) based upon work by Nadarajah and Kotz (2006).

As Fig. 20.1 shows, there is uncertainty in the COG when estimated from the deposit grade. However, when estimating the left truncated normal cumulative distribution function (cdf), the estimates are conditioned upon the COG being known. A possible alternative is an errors-in- variables approach (Schennach 2004) where both the fraction growth and cutoff grade are considered to be random variables.

The chosen optimization criterion to estimate the fraction growth (Fig. 20.3) is

$$\min \left( \sum\_{i=1}^{n} \left( F(\mathbf{x}\_{i}|\Theta) - \hat{F}(\mathbf{x}\_{i}) \right)^{2}, \right)$$

where *xi* is the *i*th COG and *F* is the cumulative distribution function. *Θ* contains the estimated parameters. If *F* is a normal distribution the parameters would be *μ*̂and *σ*̂.

**Fig. 20.3** Deposit fraction growth plotted against cutoff grade (COG) in percent for the 8 deposits used in this study

The *i*th COG is represented by *xi* and *F*̂ <sup>ð</sup>*xi*Þ. Note that *<sup>F</sup>*̂ <sup>ð</sup>*xi*Þ= 1<sup>−</sup> *<sup>G</sup>*̂ <sup>ð</sup>*xi*<sup>Þ</sup> where *<sup>G</sup>*̂ ð*xi*Þ is the fraction growth. The nonlinear least squares package used to estimate the left truncated normal model parameters is nls2 (r-project.org). See Grothendieck (2013). The left truncation point is *λ*= 0.

Deposit growth as a function of cutoff grade was modeled for each of the eight deposits (not shown). These results indicate that the data could have been generated from the same population Thus, the observations were pooled and a single model was fit. The reason to fit a cumulative distribution function was twofold. One was that eight deposits were used so the data was not in the form of a stepwise function. The second was that the data were not randomly or systematically spaced across the domain of the empirical distribution. The data, expressed as an empirical distribution function, together with the cumulative left truncated normal distribution fit and confidence intervals, are shown in Fig. 20.4. The results of the least square fit were *μ*̂= 0.0609 and *σ*̂= 0.0282. The residual sum of squares, RSS = 0.3631.

The 95% confidence and prediction intervals for nonlinear estimation are approximate. The confidence interval shown in Fig. 20.4 (dashed lines) is from package propagate, r-project library predictNLS programmed by Spiess (2014) based upon work by Bates and Watts (2007), and others. It uses a second-order Taylor series expansion and Monte Carlo simulation. The second order approximation captures the nonlinearities around *f*(*x*). A corresponding algorithm for the prediction interval has not been developed. The prediction interval shown in Fig. 20.4 (dotted lines) is based upon a linear model of the form *H* =*α*<sup>0</sup> + *α*1*U* + *ε* where *U* was the COG. *H* is a linear estimate of growth. The next step was to estimate the upper and lower prediction intervals for the linear model with *U* = 0, 0.001, 0.002, …, 0.150. These are vectors **LPIu** and **LPIl** respectively. The upper and lower 95% nonlinear confidence interval vectors estimated above are **CIu** and **CIl** respectively. The differences between the linear prediction intervals and the nonlinear confidence intervals are computed as follows. Let **Lud** = **LPIu** − **CIu** and **Lld** = **CIl** − **LPIl**. The estimated upper and lower predictions intervals, **UP** and **LP**, for the nonlinear fit (Fig. 20.4) are **UP** = **CIu** + **Lud** and **LP** = **CIi** − **Lld.** These estimates appear reasonable in the given domain, namely for COG between 0.04 and 0.10.

**Fig. 20.4** Data fit to a left truncated (at 0) normal distribution is the solid line. The approximate 95% confidence interval is the dashed line. The approximate 95% prediction interval is the dotted line

**Fig. 20.5** Histogram of residuals for fit to a left truncated normal distribution

A histogram of the residuals, which appear normal, is shown in Fig. 20.5. The truncated normal probability density function corresponding to the cumulative distribution function (Fig. 20.4) and COG data are shown in Fig. 20.6.

Figure 20.7 is like Fig. 20.4 except that the variable plotted on the vertical axis is the fraction growth as opposed to the cumulative distribution. There is no suggestion that the model illustrated in Fig. 20.7 is universal, even for molybdenum deposits. Clearly different deposits may require different models.

### **20.4 An Example**

Suppose the problem is to estimate the fraction growth corresponding to a COG (%) = 0.06 using the model shown in Fig. 20.7. Then, given that the assumed distribution is a truncated normal at zero with estimated model parameters, *μ*̂= 0.0609 and *σ*̂= 0.0282, the results are shown in Table 20.1. The point estimate of fraction growth, namely 0.479, is straightforward to compute. Namely it is:

$$\hat{F}\_L(\boldsymbol{x}|\hat{\boldsymbol{\Theta}}) = \frac{F(\boldsymbol{x}|\hat{\boldsymbol{\Theta}}) - F(\boldsymbol{\lambda}|\hat{\boldsymbol{\Theta}})}{1 - \hat{F}(\boldsymbol{\lambda}|\hat{\boldsymbol{\Theta}})}, \quad \boldsymbol{x} > 0, \left\| \boldsymbol{\hat{\theta}}' = (\boldsymbol{\hat{\mu}}, \hat{\sigma}^2) \right\|$$

**Fig. 20.6** The fitted truncated normal probability density function and COG data (the circles)

**Fig. 20.7** Fraction growth as a function of COG (%) and corresponding fitted values (solid line), 95% confidence interval (dashed line) and 95% prediction interval (dotted line)


**Table 20.1** Estimated fraction growth, 95% confidence and prediction intervals for COG (%) = 0.06

The confidence and prediction intervals are more difficult to compute; however, the R code is available on request from John Schuenemeyer.

### **20.5 Conclusions**

Mineral deposit growth commonly constitutes most unknown resources. The growth considered in this study is due to a progressively lower cutoff grade, which may be unknown. In this study, a statistical model was constructed to model cutoff grade as a function of deposit grade, followed by construction of a model to estimate the fraction growth as a function of cutoff grade. This latter model involves estimation of a truncated normal distribution and second order Taylor series estimates to characterize uncertainty.

**Acknowledgements** Data used in this chapter represent part of an extensive and ongoing data compilation effort on porphyry molybdenum deposit types. This study evolved over several years through discussions with current and former U.S. Geological Survey employees including Arthur A. Bookstrom, Mark D. Cocker, Robert J. Kamilli, Keith R. Long, Steve Ludington, Barry C. Moring, Greta J. Orris, Ryan D. Taylor, Jay A. Sampson, and Gregory T. Spanski. Eric Seedorf, University of Arizona, Tucson provided his bibliography on porphyry molybdenum deposits, which was of considerable use in this study.

### **Appendix 1**

Porphyry molybdenum data for 34 selected deposits used to model molybdenum cutoff grade as a function of deposit grade.

[Country and state codes: AUQL = Australia, Queensland; CHHN = China; CHNA = China; CNBC = Canada, British Columbia; CNNF, Canada, Newfoundland and Labrador; CNON = Canada, Ontario; CNYT = Canada, Yukon Territory; GRLD = Greenland; MCDA = Macedonia; MNGA = Mongolia; MXCO = Mexico; RUSA = Russia; USAK = USA, Alaska; USID = USA, Idaho; USMT = USA, Montana; USNV = USA, Nevada; USWA = USA, Washington]


### **Appendix 2**


Molybdenum data for estimating fraction deposit from cutoff grade; *n* = 58

(continued)


(continued)

### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **Chapter 21 General Framework of Quantitative Target Selections**

**Guocheng Pan**

**Abstract** Mineral target selection has been an important research subject for geoscientists around the world in the past three decades. Significant progress has been made in development of mathematical techniques and estimation methodologies for mineral mapping and resource assessment. Integration of multiple data sets, either by experts or statistical methods, has become a common practice in estimation of mineral potentials. However, real effect of these methodologies is at best very limited in terms of uses for government macro policy making, resource management, and mineral exploration in commercial sectors. Several major problems in data integration remain to be solved in order to achieve significant improvement in the effect of resource estimation. Geoscience map patterns are used for decision-making for mineral target selections. The optimal data integration methods proposed so far can be effectively applied by using GIS technologies. The output of these methods is a prognostic map that indicates where hidden ore bodies may occur. Issues related to randomness of mineral endowment, intrinsic statistical relations, exceptionalness of ore, intrinsic geological units, and economic translation and truncation, are addressed in this chapter. Moreover, a number of specific important technical issues in information synthesis are also identified, including information enhancement, spatial continuity, data integration and target delineation. Finally, a new concept of dynamic control areas is proposed for future development of quantification of mineral resources.

### **21.1 Introduction**

Instead of elaboration of new techniques, this chapter focuses on fundamental aspects in mineral resources assessment (Pan et al. 1992). Some of the critical issues are reconsidered here with respect to new understanding of basic geo-relations

G. Pan (✉)

China Hanking Holdings, 227 Qingnian Avenue, Hanking Tower 22nd Floor, Shenyang 110016, Liaoning, People's Republic of China e-mail: gpan100@yahoo.com; pangc@hanking.com

<sup>©</sup> The Author(s) 2018

B. S. Daya Sagar et al. (eds.), *Handbook of Mathematical Geosciences*, https://doi.org/10.1007/978-3-319-78999-6\_21

between resource descriptors and geological processes. Various multivariate models and techniques have been used over the past two decades to relate geological variables to some aspects of mineral occurrence or deposits. Conventional objective methods for mineral resource assessment have estimated either mineral endowment or discoverable mineral resources of a particular type of deposit in a region. The mineral endowment of a region usually refers to that quantity of mineral in accumulations meeting specified physical characteristics, such as grade, size, and depth. A multivariate endowment model is essentially characterized by a particular information extraction strategy for the so-called optimum combination of those geological features most related to spatial variations of endowment (Pan and Harris 1991). Most of these models estimate mineral resources based upon the principle of analogy, i.e., the resources in a study region are estimated by a model that is established on a control area by assuming different regions with similar geological environments have similar endowment (Pan and Harris 1991; Harris 1984; Harris and Pan 1991; Pan and Harris 2000; Agterberg 1981, 2014).

Most of these models have employed as information reference a grid of regularly spaced cells (inter-grid areas) and have dealt in one way or another with either mineral favorability, probability, mineral wealth or density of mineral occurrence (deposit). Of special interest have been those models that describe uncertainty about these estimates, such as the probability for occurrence of mineral deposits within a cell. These studies seem to have been a necessary step in the evolution of the science of mineral resources prediction, because geologists in general have been slow to adapt quantitative methods, and even reluctant to substitute objective and quantitative analysis for all or part of subjective analysis. Thus, there was a need to demonstrate quantitative methods that could be used to estimate undiscovered mineral resources. However, to some extent, this reluctance represented the dissatisfaction by geologists for the at-best low, and sometimes trivial, level of geoscience information captured by the quantitative variables and related to mineral occurrence by the multivariate models. Simply stated, mineral resource estimates by quantitative and objective methods will not improve significantly until more geoscience information is related in more appropriate ways to the various descriptors of mineral resources.

Supplying worldwide demand of metallic raw materials throughout the rest of this century may require multiple times the amount of metals contained in known ore deposits (Patiño Douce 2016a, b). Sustainability of resource supply is a key task for scientific mineral assessments. The concept of mineral resource is many faceted, including physical and chemical properties of mineral deposits, as they occur naturally in the earth's crust and economic properties created by man's socio–technical production system and the demands for mineral materials derived there from. The discussion presented here focuses upon several aspects of mineral resources that are fundamental considerations in the effective information synthesis for mineral resource estimation: randomness of mineral endowment, basic statistical relations, scarceness, geological foundations, economic truncation and translation, and spatial continuity. Some major issues in quantitative mineral resource estimation are addressed, including information enhancement, information synthesis, as well as target identification. Information synthesis is a central task in both mineral exploration and resource estimation.

### **21.2 Randomness of Mineral Endowment**

Most of the past and current studies on mineral resource estimation have been constructed and applied on the basis of a common assumption that mineral endowment descriptors and at least some of the related geologic processes behave more or less according to certain stochastic rules. The assumption is seldom challenged, although controversies have continued over four decades, for example, the types of the stochastic laws that govern the true distributions of geochemical element concentrations (Harris 1984; Vistelius 1960; Brinck 1972). This seems to indicate that the assumption that some geological processes are to some extent stochastic and follow certain stochastic laws has been widely accepted, although it is premature to assert that all of the geoscience features are stochastic. It is useful to examine this notion before investigating specific stochastic laws for particular geologic events, the use of statistical models to estimate mineral resources, and probabilistic descriptions of resource descriptors.

In his famous 'Ideal Granite Model', Vistelius (1972) showed that the crystallization of minerals, such as potassium feldspar, quartz, as well as plagioclase contained in the 'ideal granite' can be modeled by some stochastic functions that vary in space and time. It has been proved mathematically that there is a three-dimensional 'packing of particles' such that the three mutually perpendicular directions can be described according to the Markov property in each direction with identical transition probability matrices in the three directions (Vistelius and Harbaugh 1980). Another example due to Vistelius is his gravitational stratification package model (Visteluus 1981). In the study of red beds of the Cheleken Peninsula, under certain assumptions, Vistelius showed that the sequence of red beds with two distinct states, S (arenaceous beds) and A (argillaceous beds), can be treated as a homogenous reversible Markov chain of second order, with the partial transition through A being first order Markov and the partial transitions through S being second-order Markov.

Sedimentary sequences have been regarded generally as some types of cyclic processes which are associated with certain Markov properties (Schwarzacher 1969; Hattori 1976; Pan 1987; Kantsel 1967; Pan and Porterfield 1995). Pan (1987) demonstrated that many sedimentary sections can be treated as homogeneous stochastic processes if no significant depositional discontinuities or structural unconformities occur in the sequences and that homogeneous sedimentary processes can be decomposed uniquely into the sum of independent reversible and unidirectional stochastic flows.

The process of ore deposition was closely examined by Kantsel (1967) based upon the function of metal distribution in ores. The process of hydrothermal mineralization during a single stage can be treated as a continuous stationary process of the Markov type. The resulting concentration of metal can be represented by a distribution function, the most important characteristic reflecting speed of the mineralization process. Stochastic modeling methods and uncertainty quantification are important tools for gaining insight into the geological variability of subsurface structures and formation of mineral deposits (Wang et al. 2017). Modeling of 3D geological processes helps reveal hidden information on the variability of controlling factors, which defines likelihood of occurrence of mineralization processes.

These contributions are informative about some fundamental and crucial controversial issues regarding the application of stochastic models to mineral exploration, although some concerns cannot be satisfactorily resolved without more research. A partial conclusion drawn from these preliminary works should be that *at least under certain conditions some of the geologic or earth processes can be modeled by stochastic laws*. However, it would be incorrect to associate the earth processes with the stochastic laws through one to one relations, since the random properties of geologic events generally are space and time dependent.

### **21.3 Fundamental Geo-process Relations**

Observations on geologic features in certain spatial and temporal settings are the outcomes of a sequence of geologic processes superimposed during crustal evolution and initiated by inner energies of the earth, biosphere, hydrosphere, atmosphere, as well as other universal forces. Conceptually, there should be two levels of cause–effect relations among the geologic events, crustal evolution and initial forces, that created the earth. The earth commonly represents the entity of earth processes, e.g., crustal movement, magmatic intrusion, migration of ore-bearing fluids, erosions, etc., while geologic entities, such as lithologic phases, hydrothermal alterations, geologic structures, ore deposits, etc., are outcomes of the processes. Let o1, o2, …, ok denote the *k* initial forces, f1, f2, …, fp the *p* earth processes, and z1, z2, …, zm the *m* geological features, including resource descriptors. Then, the cause–effect relations may be conceptualized as follows:

$$f\_j = g\_j(o\_1, o\_2, \dots, o\_k), \quad j = 1, 2, \dots, p,\tag{21.1a}$$

$$z\_i = h\_i(f\_1, f\_2, \dots, f\_p), \quad i = 1, 2, \dots, m. \tag{21.1b}$$

The conceptual model (21.1a, 21.1b) implies that the original forces are direct causes of the crustal evolution represented by a series of geologic processes which in turn are the direct causes of the geologic features (outcomes). Since some of these geologic features are resource descriptors, such as number of deposits, quantity of endowment, etc., relation (21.1a, 21.1b) states that a mineral deposit is the result of a sequence of superimposed geologic processes. The functions *gj*'s and *hi*'s may be assumed to be random, provided that the original causes or geologic processes are considered to be stochastic.

A relevant question in statistical estimation of resources concerns basic statistical models useful for describing inherent relations between the geodata and resource descriptors given that geoscience information is stochastic. One should keep in mind the basic cause-effect relations (21.1a, 21.1b) and that these cause-effect relations do not imply any cause-effect between the resource descriptors and other geological features, although syngenetic or parallel relations do exist because both of these are outcomes of some common earth processes. For example, both argillic alteration and copper mineralization result from the same process of magmatic intrusion. Since the current knowledge on the original causes is very limited, it is not realistic to discover relations *gj*'s in (21.1a, 21.1b). Assuming that the random portions of the earth's processes can be isolated from the deterministic part, the following two sets of auxiliary relations should be essential:

$$r\_l = q\rho\_l(f\_1, f\_2, \dots, f\_p) + \nu\_l, \quad l = 1, 2, \dots, d,\tag{21.2a}$$

$$z\_i = \psi\_i(f\_1, f\_2, \dots, f\_p) + e\_i, \quad i = 1, 2, \dots, m,\tag{21.2b}$$

where *rl*'s are the resource descriptors, *zi*'s are other geologic features and *vl*'s and *ei*'s are the random errors. However, a further difficulty arises because our knowledge of earth processes is also limited. What one can observe in practice are only the geological features *zj*'s and maybe part of the resource descriptors. Although there is no direct causal relation between the mineral resource descriptors and other geologic features, their syngenetic and concurrent relations will assure some indirect information from the geologic features about the resources. Hence, the geological processes, and thus the mineral resource descriptors, can be mathematically reconstructed through a reverse functional estimation:

$$f\_j = \mathbb{1}\_j(z\_1, z\_2, \dots, z\_m) + a\_j, \quad j = 1, 2, \dots, p,\tag{21.3a}$$

$$r\_l = \Phi\_l(f\_1, f\_2, \dots, f\_p) + \varepsilon\_l, \quad l = 1, 2, \dots, d,\tag{21.3b}$$

where *ω<sup>j</sup>* and *ε<sup>l</sup>* are the random error terms for the geological process and resource descriptor estimates.

Accordingly, if m is much greater than *d*, a feasible solution for mineral resource estimate may be completed in two steps:


The first step of the manipulation is exactly analogous to factor-type analysis, constructing significant geologic factors (causes) from observable geological features, whereas the second step is regression-type analysis, predicting the resource descriptors (effects) from the geological factors. Consequently, *factor*-*type and regression*-*type models should be fundamental multivariate statistical models for quantitative mineral resource estimation, and other relevant statistical methods may be considered as variations and combinations of the two types of method*. That's why the mineral resource descriptors (*r*) can be statistically estimated through the geological features by the following function:

$$r\_l = \Theta l(z\_1, z\_2, \dots, z\_m) + \theta l, \quad l = 1, 2, \dots, d,\tag{21.4}$$

where *θ<sup>l</sup>* the random error. The geological processes are directly created by the initial forces of earth movement, while accumulation of mineral resources is directly resulted from complex interactions of the geological processes. Since the geological processes cannot be directly measured, they must be reconstructed by observable geological features, which can be, in turn, indirectly used to estimate mineral resource descriptors through relation (21.4).

### **21.4 Scarceness, Rareness, and Exceptionalness**

The activities of mineral exploration have been motivated chiefly by economic and social pursuits (Pan et al. 1992). Constantly growing economic and social demands require greater amounts of raw material, including nonrenewable mineral commodities. The conduct of mineral resource exploration is predicated upon the economic return expected from the discovery of new deposits. An increase in the price of a mineral product, which is equivalent to the sum of the marginal rent and marginal extraction cost, indicates that the mineral resource has become scarce. A basic perspective of both geologists and economists is that mineral resources are scarce materials in the crust as they occupy only an insignificant portion of crustal material.

Any major ore deposit may be regarded in principle as an anomalous or rare phenomenon commonly characterized by one or more geological, geochemical, and geophysical features. Consequently, signatures of significant endogenic mineralization are anomalous and exceptional geologic settings (Gorelov 1982). In particular, the formation of a giant deposit is an extremely rare event created by an exceptional combination of earth processes. Rareness of the giant deposits is reflected in both spatial and temporal dimensions. Significant concentrations of a metal usually have a strong affinity or correlation with particular geologic formations and epochs, as well as metallogenic environments. The genesis of giant deposits may be controlled by particular regularities that differ from those controlling the formation of medium and small–size deposits of the same composition. It is also thought that the formation of huge deposits appears to be controlled by a so–called 'ore–controlling structure' (Tomson and Polyakova 1984).

Giant deposits often dominate reserves and production. It is not uncommon for a few supergiant and giant deposits to constitute over 50% of the total metal recoverable under current economic and technological conditions; accordingly, the metal quantity in small size deposits is almost negligible (Laznicka 1983). Conversely, giant deposits typically constitute an insignificant part of the total number of ore deposits.

Thus, the scarcity of a mineral resource is essentially determined by the fact that few giant deposits exist in the crust, but the few that do exist strongly dominate reserves and production. Accordingly, the economic viability of mineral exploration is strongly predicated upon its capability of locating the giant or large mineral deposits through delineating the associated geologically anomalous regions of the crust. Unfortunately, conventional quantitative techniques employed have failed to deal with these important particulars satisfactorily, mainly owing to inability to capture the nature of these exceptional constraints, since these unique deposits rarely exhibit common statistical properties.

The discovery process for some deposit types, e.g., those for which structural, geochemical, alteration, or geophysical signatures are correlated to deposit size or those for which discovery is primarily by drilling and for which size is strongly related to areal extent, is size biased, meaning that large, high-grade deposits tend to be discovered in early stages of the exploration of regions (Chung et al. 1992; Pan and Harris 1991). For such deposit types, the prognostication of exploration outcomes or the estimation of additional resources in undiscovered deposits should take into account the implication of this bias to the tonnages and grades of the undiscovered deposits. However, representing the discovery process of other deposit types, such as vein deposits with great vertical extent or those for which size is only weakly related to exploration anomalies, as size bias sampling may not be appropriate (Stanley 1992). Improvement in locating deposits or in estimating probabilities for their occurrence requires consideration of the exploration effect and the conjunction of improved genetic, tectonic, and other unifying geoscience theories with improved synthesis methods for the effective extraction of information from diverse geodata and improved quantitative models for inference or estimation.

Considering the low concentration of many elements, e.g., 65 ppm for copper, in common crust rock, the presence of a large accumulation (1 to 10 million tons for copper) of metal at concentrations that are mined today requires enrichments by 100 or 1000 s times crustal concentrations and the accumulation of metal from a large amount of common crustal materials into a relatively small volume. Typically, this concentration or accumulation is seen as requiring the successive operations of several enrichment-depletion stages. Since these sub-processes rarely take place at the scale and strength required to form an ore deposit, their joint (sequential) occurrence could be an extremely rare event in both space and time. If each of these processes is assumed to be stochastic, the mineralization process is also stochastic, and thus the formation of ore deposits is deemed to be a rare, random event. To the extent that this assumption is acceptable, the concept of rareness of ore deposits is equivalent to the smallness of the probability for the formation of an economic deposit.

The concept of rareness can be compared to that of exceptionalness described by Gorelov (1982) and the conditional exceptionalness proposed by Pan (1989). Some other terms found in literature carrying similar meanings include atypicality, uniqueness, anomaly, etc. The concept of exceptionalness is important and useful in quantitative mineral exploration. The most general feature of major commercial ore deposits is that the geological structures of their ore fields are exceptional and anomalous compared with those of neighboring areas.

It is noted that scarceness is a term relevant to economic aspects of resources, rareness is more closely associated with statistical (probabilistic) characteristics of mineral occurrences; and exceptionalness should be used in a geological context. More specifically, one would say that ore deposits are probabilistically rare and geologically exceptional, even though the metal derived from them may not be scarce in the economic sense described by Barnett and Morse (Barnett and Morse 1963). These terms are often used to describe the status of mineralization events in a relative sense, but they can be statistically quantified in a rigorous framework.

### **21.5 Intrinsic Geological Unit**

Most traditional resource estimations have been made on the basis of regular inter-grids or cells as the sampling scheme and estimation unit. The "cell" approach is associated with a number of drawbacks. The most significant problem is that geological processes can be reconstructed through observable geoscience features, which are measurable in geological units, not artificial cells. The cell-based measurements tend to distort the intrinsic relations between geological features and mineral resource descriptors. Secondly, quantification of the geological features, spatially correlated and even connected, is difficult to capture essential genetic factors that played key roles of metal enrichment. Finally, the cell-approach easily ignores exceptional conditions for formation of large deposits, which cannot be readily quantified through grids.

### *21.5.1 IGU Definition*

In contrast with a population of cells having multiple attributes, consider a population in which each member consists of a set of genetically related objects, e.g., igneous intrusives and associated altered host rock, and each member is described by fields of the related geologic objects. Here, mineral resource descriptors and geoscience measures are attributes of a group of geoscience fields which in turn are attributes of a set of genetically related geologic bodies. Such a scheme employs a sampling reference for quantification and integration of geoscience information that is *intrinsic* to the deposit type being sought. That is why the Intrinsic Geological Units (IGU) was proposed by Pan (1989) and Harris and Pan (1990).

The concept of intrinsic geological units, formally documented in Pan and Harris (1993), has evolved from the notion of *intrinsic samples* (IS), or *consistent* *geological area*. The basic ideas behind both notions are identical and a minor difference lies in the procedure for delineation. This concept has some common characteristics with the notion of "geological anomalies" proposed by Zhao (2007) (also see Zhao and Chi 1991), although the procedure of unit delineations differs significantly.

An appropriately delineated IGU is at once a great improvement over the traditional inter-grid area or cell because it represents the joint occurrence of geologic bodies that are genetically related to the mineral resources of interest. Thus, even before geological attributes of the IGU are quantified, the very presence of an IGU implies highly significant geoscience information about geology and mineral resources. In contrast, the cell is simply a geometric reference. Therefore, it is inevitably true that geological attributes of an IGU carry far more geoscience information than do the geological attributes of a cell.

IGUs may be formally defined as *members of a population consisting of sets of genetically related geologic objects that are usually defined by their geofields* (Pan 1989). Each member (IGU) of the population of IGUs constitutes an independent set of geologic objects that are genetically related to each other and to mineral deposits, although generally only some of these members contain ore deposits and mineral resources. Moreover, although a particular member of a population of IGUs contains mineral deposits, it may not be uniformly mineralized everywhere within its volume. In other words, a mineral resource unit generally is a subset of an intrinsic geologic unit.

### *21.5.2 Critical Genetic Factor*

Any mineral deposit or mineralization can be considered as an anomalous concentration of one or more elements or their chemical compounds when compared to crustal materials. This anomalous region originated from anomalous genetic processes or their superposition during certain geological epochs. Usually, a genetic model consists of a hierarchy of earth processes—from preconditions to post mineralization preservation—which acted during one or more previous time spans, and as such, these processes are not observable. Instead, the geologist must infer their previous existence and operation using observable indirect evidence, e.g., geologic features, geochemical suites, hydrothermal alteration, aeromagnetic and gravity anomalies, etc.

Since particular genetic processes were initiated and developed under certain specialized circumstances, existence of mineralization, as a significant outcome of the processes, must also be conditional upon these relevant circumstances. In other words, whether an anomalous concentration of a metal exists in a region depends solely upon the existence of certain necessary conditions during crustal evolution. Although there might exist a number of such necessary conditions for a particular genetic process or mineralization, one, or at most a few of them, is referred to as critical. For convenience, this (these) critical or necessary condition(s) is called the *Critical Genetic Factor*(s) (CGF). The idea of CGF does not rest solely upon one factor being more important or critical than another in the formation of a mineral deposit, because unless all genetic factors are present, there is no mineral deposit or mineral endowment. Criticality, as used here, rests more upon the idea that the CGF arises from few, preferably only one, earth process and that those features formed by that process can be detected reasonably well by conventional sensing technologies, e.g., magnetics, gravity, geochemistry, and geology mapping. If this CGF is not present, the intrinsic geological unit is considered to be absent. For example, for a mineral deposit related to magmatic fluids, the heat source that drives intrusion may be treated as the CGF for identification of the IGUs associated with the deposits of this type. Practically, only a single CGF is necessary for identifying spatial units that are intrinsic for mineral deposits of a single genetic type, but more than one CGF may be necessary when there is more than one genetic type of interest.

An IGU can be further understood to be a member of a population consisting of sets of geologic objects genetically associated with the CGF, each set being a member of the IGU population. Individuals from the population are called *known* IGUs if the related CGF is directly observed, while others are unknown or predicted when the CGF cannot be observed directly, but is inferred to exist because of the presence of geologic fields related to the CGF and to recognition criteria.

### *21.5.3 Critical Recognition Criteria*

The CGF often may be identified as a process, based upon geoscience; conceptually, it may be an abstraction, instead of an observable feature. In order to make the CGF concept workable in practice, a set of special geologic features which give firm evidence of the previous existence and operation of the CGF are established. Such a feature is here termed a *Critical Recognition Criterion* (CRC). Each of these CRCs constitutes a sufficient condition for existence of the CGF. Any spatial location at which one or more CRCs occur is by definition a location within an intrinsic unit.

Although the concepts of CRC make it possible for identification of CGF, the occurrence of CRCs known at the time of application may not represent the entire picture of a CGF. In other words, estimation of the presence of a CGF based upon only CRCs could be biased due to imperfect knowledge on the spatial distribution of CRCs. For example, a CRC might exist underneath the sedimentary cover, even though it is not found by surface geological mapping. This fact dictates that the identification of CRCs beyond surface observation is an important step in the appropriate prediction of the distribution of the CGF. This can be done by establishing statistical relations of each CRC to a set of selected geological, geochemical, and geophysical fields, which provide indirect evidence for the presence of the CGF.

Although the existence of a recognition criterion at a spatial location almost surely indicates that the location is within an IGU, the boundary of the IGU still is unknown. Consider, for example, the outcrop of a Tertiary intrusive assumed to be a CRC. Then, the outcrop area is surely within an IGU, but probably, some of the area around the outcrop also is within the same IGU because of the likelihood that at depth the intrusive extends laterally underneath the surface rocks. Consequently, the boundary of an IGU is usually uncertain. One way of representing such uncertainty is to assign each spatial location a probability for presence of one or more recognition criteria based upon a collection of geological observations at that location.

### *21.5.4 IGU Delineation*

At a known location (with at least one observed CRC), the probability for the CGF should be one or very close to one. This implies that the point is almost surely within an IGU. At an unknown location (with no observed CRCs), all of the CRC probabilities estimated from geoscience fields will provide a measure of the likelihood of the presence of the CGF.

Several methods have been proposed and employed for delineating IGUs. One such example is that which consists of three steps developed by Pan and Harris (1993). The method delineates IGUs by estimating and combining probabilities of CRCs. Another example is given by Pan (1989) and Harris and Pan (Harris and Pan 1991) based on the union of marginal field anomalies. As discussed, the presence of a CRC gives evidence for the existence of an IGU; delineation of the boundary of the IGU is made by resolution of the geoscience fields associated with the CRCs. In this approach, the key step is to establish a procedure to identify the anomalies in terms of CRCs for each geosciences field. These anomalies (called marginal anomalies) are then combined into one anomaly through spatial union. This is similar to the concept of using the maximum CRC probability to represent the probability for CGF.

As we know, genetic theories are most useful for grass-roots exploration or reconnaissance programs, where deposit information is not abundant. Without the guidance of genetic models, it is unsafe to select an area for a massive investment. Hence, the concept of IGU is most useful for regional mineral exploration, because it provides a quantitative framework for delineation of those areas having the conditions necessary for the presence of deposit. In large-scale exploration, such as deposit or district scale, the methodology of IGU is still useful if detailed aspects of deposit genetic models can be specified. With abundant occurrence information, it is possible to extract genetic factors as necessary conditions for the localization of deposit. However, in most cases, this detailed information is not available or not in a usable form. In general, a mining district is already a known IGU defined by broad genetic models. Unless refined genetic models are available, IGU will not provide additional power to identify areas for the potentials of deposit or district scale.

### *21.5.5 Relations Between IGU and Mineral Target*

As discussed, CGF serves as the necessary condition for presence of an IGU, but it is not a sufficient condition for the boundary definition of the IGU. The purpose of IGU proposal is to improve methodology of target identification and delineation, which, in turn, improves the effect of mineral resource assessment. The IGU theory creates a new platform on which new approach to mineral target identification can be constructed. A critical question to ask would be what is the relation between IGU and mineral targets?

Theoretically, an IGU is a necessary condition for presence of mineralization of interest. The concept of IGU provides a precursor to the identification of mineralization or deposits. However, presence of an IGU does not necessarily serve as sufficient conditions to the presence of mineralization or deposit. Presence of an IGU is a necessary condition of presence of mineral target. In general, an IGU is much broader in areal or volumetric extents than a mineral target. Mineral targets are defined in the IGU areas where additional necessary and even sufficient conditions are observable or inferable from maps or data collected from various sensing or engineering technologies. Instead of using an inter-grid sampling scheme, the framework of IGU provides a more practical and useful approach for extraction of sufficient conditions for identification of mineralization events through reconstruction of geological processes that resulted in the occurrence of mineralization.

For mineral resources appraisal, the concept of IGU establishes a theoretical base for definitions of necessary and sufficient conditions of mineralization or deposit. It has radically changed the conventional methodology for estimation of mineral potentials. The relationships of IGU, target, occurrence, and deposit are depicted as follows:

### *Deposit* ⊆ *Mineral Target* ⊆*IGU* ⊆ *Working Area*

Clearly, an IGU is not a mineral target, but a mineral target must be enclosed in an existing IGU. Similarly, a mineral target is not a deposit, but a deposit must be localized inside an existing mineral target. Therefore, identification and delineation of IGUs is a necessary step for definition of mineral targets. This new approach will play a revolutionary role in improvement of mineral resources assessment.

### **21.6 Economic Truncation and Translation**

Mineral deposit is not a purely geological concept when it is linked to resources and reserves. The effects of economic truncation and translation on mineral deposits have been recognized several decades ago, and a thorough discussion of these has been given by Harris (1984). These phenomena reflect an important fact that mineral resources generally are a dynamic function of relevant economic and technologic constraints, including price of product and costs associated with various production phases, such as mining, milling, smelting, as well as refining. Available data on mineral deposits generally are truncated by a cost surface which is defined in terms of physical features of the deposits and technological states. In other words, the collection of mineral deposits reported reflects only the truncated fraction of the entire population of mineral deposits. Thus, use of these data directly and unavoidably results in biased estimates of mineral resources, as the characteristics of the resource distribution derived from the partial data set only are a distorted representation of deposits as they occur in nature.

Translation refers to the fact that commonly reported deposit grades and tonnages are for ore reserves and that these tonnages and grades generally differ from those for the total mineralized material for the deposit as a geologic phenomenon. For deposit types having great lateral or vertical gradation in mineralization, economic rents may lead to the selection of a cutoff grade that leaves part of the deposit in the ground. When this is the case, reported ore tonnage is smaller than deposit tonnage and average grade is higher than deposit average grade.

The importance of translation as a distortion varies with the mineral commodity and the maturity of the exploration activity. In general, the greater variation of the grade within a deposit (intra deposit grade variance), the stronger the translation effect, and vice versa. For those deposit types having sharp boundaries or a uniform grade distribution, the translation effect may be negligible. For some deposit types, it is also true that the longer the deposit has been mined, the greater the reserve additions and the more representative the revised ore tonnage and grade data are of the geologic deposit.

The truncation and translation effects are related to some degree when production costs are strongly influenced by ore tonnage and ore average grade, provided that intra deposit grade variation and the spatial distribution of grades permit the effective use of cutoff-average grade relations to maximize the net present value of economic rents. However, translation occurs mainly in mine development and subsequent mining, while truncation reflects both exploration and mining. Conversion of resources to reserves involves using cutoffs for grades that define boundaries of ore economic portions in the deposits. This procedure involves both translation and truncation.

In order to resolve these difficulties, Harris (1984) suggested a possible remedy: treating the truncation effect requires first identifying the truncation relationship, and second the explicit consideration of this relationship in the estimation of parameters, one of which is the correlation of deposit tonnage with grade. Although several attempts have been made to mitigate the difficulty in practical studies by employing more sophisticated mathematical methods in mineral endowment estimation, the problem remains to be explored further, as estimation of the cost relation is still based on the truncated data. Thus, the cost relation must be reconstructed from a truncated surface before estimation is carried out.

The importance of truncation and translation effects on a quantitative estimate of mineral resources depends to some degree upon the means of estimation and upon the objective of the estimation. For example, when estimation is to be done using analogue or control regions and the objective is to estimate the magnitude of resources for price, cost, and technology similar to those of the analogue regions, the effect of truncation and translation on the estimate may be minor. But, when the objective is to estimate the magnitude of resources for improved exploration and production technology, the effect of truncation and translation upon the estimate may be very significant.

### **21.7 Information Synthesis**

The geologist's view of an ore deposit may differ from that of the economist. Economists tend to consider an ore deposit as being a continuous geologic phenomenon that is discretized by applying a set of economic regularities, while geologists tend to perceive a deposit to be a discrete geologic phenomenon with anomalous concentration of one or more valuable elements (Agterberg 1981). Physical mechanisms of ore genesis suggest that the continuity of ore concentration is meaningful mainly in a relative sense. A high magnitude of element concentration in host rocks often contrasts sharply with concentrations in surrounding wall rocks. This perspective may be partially illustrated by the DeWijs' scheme of element enrichment in a deposit, which was extended by Brinck (1972) to describe element concentrations within the crust. Another well-known hypothesis is Skinner's bimodal proposition of element distribution which asserts that a gap exists between the grades of mineralized rock and the grades of common crustal material (Skinner 1976).

### *21.7.1 Spatial Continuity*

Although the continuity of the statistical distribution of grades seems to differ conceptually from that of spatial and temporal distributions, they are in fact closely related. For example, if the proposition is accepted that the grades of an element are continuously distributed in space and time, the continuity of the statistical distribution of these grades can be automatically invoked in certain environments, and vice versa. This assertion may be explained by the requirement that samples must be taken in a uniform and regular manner from the population of interest.

Metallogenic and tectonic studies depict elements to be concentrated in geologic terrains of different scales, such as ore shoot, ore body, ore district, ore belt, ore province, etc. (Laznicka 1983). This hierarchical structure of ore formation seems to indicate that continuity exists within each of these scales, while discreteness of ore concentrations can be seen between these different scales. For instance, an ore district may be viewed as a continuously anomalous region within an ore belt, but the individual deposits included in that same district are discrete geological phenomena. This perspective carries strong implications as to sampling procedures and the organization of data for the estimation of mineral potentials.

Thus, a specific mineral exploration project focused upon the ore deposits of certain valuable elements formed and confined in a particular dimensional scale requires an appropriate sampling scheme of that same scale. For example, a new ore body developed within a deposit may be considered as mineral potential at the deposit scale, while a new ore deposit discovered in a district is regarded as mineral potential at a district scale. When estimation is aimed at predicting the mineral potentials at the district scale, the sampling scheme must accommodate the geological and mineral continuity at the corresponding hierarchical level. The match in scale is a prerequisite in mineral resource estimations.

### *21.7.2 Information Enhancement*

Although in one sense considerable progress is apparent in the use of quantitative techniques for mineral exploration and resource estimation since the early work in the 1950s and 1960s (Allais 1957; Harris 1965), much less success has been made in creating estimates that are or have been used in mineral exploration and mineral policy decisions. Even though quantitative estimation of local/drilling targets may require the detailed quantitative characterization of favorable geological, geochemical, and geophysical information, many explorationists still favor subjective and qualitative methods for the integration of geodata. Concurrent with these applications, mathematical methods were designed and demonstrated, but few were adopted. Perhaps, this is a natural evolution of the science of quantitative mineral exploration in terms of data integration, because geologists in general have been slow to adopt quantitative techniques. However, this reluctance is at least partly related to ineffective integration of geodata and insufficient extraction of geoscience information by quantitative models. Mineral resources cannot be satisfactorily estimated until more geoscience information is related by improved methods to mineral occurrence. Major difficulties that have hindered further development have been far from fully attacked, and some of them are even completely ignored.

A common practice in quantitative mineral exploration is to collect all relevant geoscience data available in the study region, including numerical observations, digitized maps, and remotely sensed images. These data are then compiled, digitized, resorted, and formatted in a readily manageable data base. Each record is usually stored as a row, while each geologic attribute occupies a column. In standard statistical terms, each record in a data base is called a sample and each attribute is referred to as a variable. A sample in mineral exploration can be a spatial point or a one-, two-, or three-dimensional block. Most data in regional mineral exploration are interpreted in two dimensional areas.

Sampling schemes are considered to be an important factor in data interpretation and target identification. A viable sampling scheme should be able to cope with the hierarchical structures of mineralization or ore concentration. Mineralized geological bodies in different hierarchical scales correspond to different domains in space and time, which are generally defined by particular tectonic settings and geological formations. Statistically, samples should be randomly taken in the population of mineralized and non-mineralized geological blocks of the same scale. Furthermore, spatial characterization of geological features is another criterion for reasonable representation of the resource variability. A reliable sampling scheme should also result in a sample distribution which portrays closely the 'true' population distribution of geological and mineralized bodies. Our experience has shown that quantities measured on the basis of equal area cells might lead to distorted probability distributions.

The original data may include geological, geochemical, geophysical, as well as remote sensing information in diverse modes. For example, geological data can be hydrothermal alteration, faults, and lithology, which are typically considered as non-numerical attributes. Geochemical data can be collected from a rock outcrop, stream sample survey, or a soil grid survey. Magnetics data can be obtained from an airborne geophysical survey. It is readily seen that all these types of geodata are diverse not only in terms of sampling methods, but also the presentation of quantities. Different sampling schemes create different data densities, inconsistent spatial locations, disconnectivity, as well as uneven precisions. Different quantity presentations may give rise to even more serious problems in data integration. The most difficult problem is dealing with the correlation of different variables, which is the most critical step in geological information synthesis, especially when some data are non-numerical. The first step in overcoming these difficulties is the quantification and unification of different data sets.

The quantification of non-numerical attributes refers to assignment of a numerical value to each sample location; of course, the numerical value must convey explicit geological information. For example, a binary assignment gives 1 or 0 to the attributes to represent presence or absence. When each data set is 'quantitative', the next step is to enhance geological information of each individual data set before they are compared, correlated, and integrated. As a matter of fact, enhancement of information from original and individual data is the most critical step towards a successful information synthesis for mineral target selection. Unfortunately, geologists traditionally tend to place too much emphasis on the original data and denigrate the importance and necessity of data filtering, cleaning, and enhancing. Conversely, some geomathematicians devote too much attention to processing of data and give too little regard to fundamental characteristics of the original data and the useful information of the data. Original data carry the most genuine information, but they may be 'contaminated' or masked by noise and even distorted due to inadequate sampling or analytical methods.

Filtering and enhancing of useful information is important to remove noise and reveal signals, such as separation of soil geochemical anomalies from background values. Furthermore, one data set may carry information on several geological aspects. Some of these signals are not the major interests and their presence sometimes masks or distracts from the information useful in identifying mineral targets. These signal components are unwanted, even though they are not noise, and should be filtered out, or at least suppressed. However, many filtering, enhancing, and other data processing techniques can easily introduce artifacts or false signatures. For instance, a magnetic anomaly map generated from a short-wavelength filter can exhibit many high-amplitude, single-grid-point anomalies, which are known as the aliasing effect in the geophysical literature. Another example is interpolation which has been commonly used in data interpretation and quantitative mapping. All interpolation algorithms, e.g., minimum curvature and kriging, which can be considered as low pass filters, are notorious in that they tend to produce overly smoothed surfaces and quite often cause a loss of important detailed features. It is our opinion that some applications of quantitative analysis in mineral exploration have either failed to extract the important geoscience information or have created too many artifacts relative to signals; these effects are believed to be among the major reasons underlying the reluctance of geologists to replace qualitative judgment by quantitative analysis.

The above discussion suggests that filtering and enhancing is necessary for geological data interpretation and integration, but care is warranted in the use of enhancing techniques. Also, enhancement of a geological attribute includes identification and description of spatial structural characteristics, which constitute useful information about spatial auto-correlation of the attribute. More specifically, the objective of information enhancement is to maximize the signal relative to noise. By analogy, the best picture of an object taken by a camera requires a correct focus on the object; either too short or too long of a focus will blur the picture. Moreover, one should keep in mind that any enhancement technique cannot create information that is not present; instead, it is only able to reveal important features of the information carried by the attribute. But, without enhancement, some important features may not be identified nor employed in subsequent analyses. Since the amount of information in each attribute is limited, enhancement also is limited. A minimum level is necessary, for an insufficient removal of noise fails to reveal the signals to be extracted and used in subsequent analyses. Generally, the tendency of analysts is to ignore or inadequately remove noise and to over-enhance the signals. Of course, intense enhancement of data that contain noise leads to enhancement of noise as well as the signal and to false patterns and inter-relations with other information.

### *21.7.3 Data Integration*

Synthesis of geoscience information includes the quantification of geological observations, maps, and other geological images; extraction of quantitative variables; statistical preprocessing; filtering and enhancement; estimation of statistical relations among variables; and the combination of different data sets (layers). Clearly, most of the components require some amount of computation which can be performed more efficiently by using a computer. There is an obvious advantage of using a computer when many variations of the same type of analysis are required (Green 1991) or when important information includes the computer interaction of several large sets of geodata. This additional information helps to reduce uncertainties and ambiguities in geological interpretation and mineral potential estimation. Furthermore, some effective and sophisticated statistical techniques which generally prohibit manual calculations can be readily implemented on a computer.

Mineral exploration generally deals with diverse geological data in various chemical and physical forms. Appropriate information synthesis should reflect the types of information contained in each data set and their geological implications. For example, geochemical information is generally different than geophysical data. Even the same type of data, e.g., geochemical, may require different interpretation when it is obtained through different sampling techniques. For instance, soil geochemical samples are processed in different ways from stream samples. Geophysical data are rich in depth information and are capable of locating blind targets, but the extraction of such information requires appropriate processing and analysis. It is important to note that any data set has its limitations in the diagnosis of geologic favorability for mineralization, and interpretation and information synthesis must recognize these limits. Because of vast differences in geoscience content, precisions of measurement, and scales of reference among diverse geologic data, integration of these data directly cannot constitute their optimum use in mineral exploration unless the data are appropriately preprocessed and unified. Unfortunately, these problems are far less than adequately treated in traditional exploration applications.

Geoscience attributes are usually processed, correlated, and integrated to produce some estimates which characterize the favorability or probability of mineral occurrence. A more comprehensive approach treats each of the various kinds of geoscience information as a field of a particular type, e.g., geochemical fields, magnetic fields, etc. (Harris and Pan 1990, 1991). Mineralization may also be viewed as an ore field. The notion of field enriches useful information about three dimensional characteristics of geological bodies. Such a field is generally more expressive of meaningful geoscience information relevant to mineral resources than are 'man-made' variables, e.g., measurements quantified with regard to an artificial reference, such as a grid.

A major objective of information synthesis is to maximize the extraction of relevant geoscience information in terms of mineral potentials. Geological measurements in mineral exploration are commonly multivariate in terms of either several variables (fields) measured at same sample locations, or different variables measured in different sample locations but in the same study region. In the latter case, synthesis may require an appropriate interpolation of the data before they can be jointly analyzed. When strong correlations exist among the variables, multivariate techniques are necessary to capture the joint information from multiple associations as well as the marginal contributions from individual attributes. A multivariate exploration system sometimes can be decomposed into several less significantly correlated sub systems with smaller dimensions. This partitioning may reduce the complexity of modeling and possibly permit more robust estimates at the expense of decreasing the degrees of freedom in the system.

Optimum combination of different geological data sets (layers) has been a central task in data integration and information synthesis. Agterberg (1989) gives a comprehensive review on some major integration methods developed in recent years. Two major types of models notable in literature include favorability analyses and probability methods. Pan and Harris (1992) propose a weighted canonical correlation method for the estimation of a favorability function. These methods are most suitable for combining continuous geological attributes. Agterberg (1992) provides probabilistic techniques for combining indicator patterns in weights of evidence modeling. Both types of models, however, are deficient in some regards. Favorability methods often carry ambiguities in predicting mineral potentials, whereas evidence combination techniques are subject to strong constraints on the independency of different attributes. Moreover, as an information synthesis method, weight of evidence is simplistic. Another useful combination approach is color (RGB) image composition (Sabins 1987). This type of technique also bears some serious limitations, since most current image processing software systems are only capable of combining a very limited number of 'layers'. Therefore, there is a need for development of more effective combination methods.

Geologic information about mineral occurrence may be roughly grouped into two categories: marginal information contributed from individual variables or fields and joint information contributed from the cross correlations between different variables or fields. The first category of information has been extensively quantified and interpreted in most of the traditional studies on mineral exploration. The second category, however, has been inadequately treated due to complexities and ambiguities. Information from the inter-dependencies of variables can be an important factor in improving the definition of exploration targets, if single exploration variables are ambiguous, noisy, and/or uncertain as to mineral occurrence. Thus, an effective synthesis technique must be able to efficiently quantify and extract the cross-correlation information.

Intuitively, there should exist a combination of variables in multivariate mineral exploration that is sufficient to capture the majority of useful information and at the same time to minimize the effort of manipulation. It is probably incorrect to think that more variables are always preferred. On the contrary, a large set of data almost always contains redundant information which, if not appropriately eliminated, can result in unstable solutions and create noisy estimates. Therefore, another important problem in information synthesis is to select and refine variable sets such that redundant and trivial variables are excluded from consideration.

### *21.7.4 Target Delineation*

Mineralization is considered as an anomalous geologic event, because the element is either present in anomalous grades, rare minerals, or in anomalous quantities. The purpose of mineral exploration is to locate economic mineral deposits in such anomalous regions based on direct and most often indirect information (chemical, physical, structural, etc.) and ore genetic theories. Since the direct information, e.g., the concentration of the metal of interest, is usually meager in the early stages of exploration, indirect information (e.g., geological, geophysical, geochemical, remote sensing, etc.) is commonly employed to identify mineral exploration targets. However, the mineralized anomalies, which are distinctive from the surrounding areas in terms of the accumulated metal(s), are typically fuzzy or ambiguous in terms of indirect information. Therefore, ambiguities of information raise an intricate question, i.e., how to 'best' define targets in terms of the maximum inclusion of mineralized rock and exclusion of non-mineralized rock.

Information synthesis produces either a set of processed (enhanced, quantified, integrated) geological, geochemical, geophysical fields, or a single synthesized index characterizing the favorability/probability of mineral occurrence. Based upon the derived grids, maps, or images, all of which are commonly referred to as 'layers', mineral exploration targets can be delineated by overlaying or combining the different layers. Since the synthesized results, however, are generally continuous, some threshold values are necessary to define the boundaries of targets. The traditional approaches to determine the boundaries are generally subjective and tend to introduce too many uncertainties. Obviously, a precise definition of a target is an important exploration problem to be solved.

Delineation of potential mineral targets has been a central task especially in the earlier phases of a mineral exploration program. Target areas have been identified by either subjective or objective analysis. Subjective methods provide opportunity for the maximum use of genetic theories of ore deposits and connect genetic knowledge and geological observations either intuitively by expert geologists or formally by a computer system (Harris and Carrigan 1981; Finch and McCammon 1987; McCammon 1990; Koch and Papacharalampos 1988). Subjective methods have been generally formulated as follows: (i) formulate genetic models, (ii) relate geological observations to genetic processes, and (iii) estimate subjective probabilities of mineral occurrence. Objective (mathematical) methods attempt to maximally use various existing mineral occurrence data and quantified geological variables (Botbol et al. 1978; Chung and Agterberg 1980; Agterberg 1988; McCammon et al. 1983; Singer and Kouda 1988). An objective approach generally consists of three major steps: (i) quantification of geological variables, (ii) estimation of mathematical models, and (iii) extrapolation of the estimated models to identify target areas.

Ore genesis models are crucial in mineral exploration and resource evaluation. Since genetic models of ore deposits are usually constructed on the basis of man's past experience, imagination, and logical inference, they have a natural connection to subjective probability analyses and expert systems, giving such an approach great potential for prediction. However, in practice this approach also is subject to some limitations. First, expert systems are costly to build and to validate; second, the full potential of such systems requires the construction and incorporation of extensive data bases. Without such data bases, estimates may be associated with large uncertainties. Furthermore, genetic models change as knowledge is acquired and geologists often disagree on at least some points of a genetic model; this creates uncertainty about the identification of mineral targets. An obvious advantage of objective methods is the production of relatively robust estimates of mineral potentials by extensively using geological, geochemical, and geophysical data. However, these methods also are deficient in some regards. Without using genetic theories, geoscience information content of the variables may be low and may have poor predicting power, i.e., the estimates often 'at best' reproduce what an expert geologist had recognized.

A useful procedure as a link between the two types of model is outlined as follows. First, based upon genetic theories, identify one or more critical genetic factors which are considered as necessary conditions for ore formation. A mineral deposit is believed to be absent if these genetic factors do not exist. Second, identify a set of recognition criteria that offer 'almost sure' existential evidence for critical genetic factors. Third, estimate the favorabilities or probabilities of occurrence of these recognition criteria based upon multiple geodata sets. Fourth, generate a synthesized favorability or probability measure for the occurrence of critical genetic factor(s) based upon the probabilities estimated in the third step. Finally, potential exploration targets are delineated from the synthesized favorability or probability measure through optimum discretization (Pan and Harris 1990). These targets have been referred to as *intrinsic geological units* with respect to the chosen critical genetic factor(s) (Pan and Harris 1993). These targets are so-called chiefly because they are not delineated directly in terms of mineral deposits, but in terms of the critical genetic factor that is a necessary condition for formation of the mineral deposits.

Upon the completion of target delineation, a decision needs to be made as to which targets should receive high priority to be drilled, as different targets vary in the degrees of favorability of mineral occurrence. This need requires the ranking of the targets in the sequence of drilling plans. Rank estimates may be derived directly from the synthesized fields or index. When a reasonable amount of known information on the metal(s) of interest is available in the study region, the rank estimation can be substantially improved by using a functional relation between the synthesized index and the quantity of metal. Of course, estimation of metal quantities is a difficult task, if not impossible. Such a function for estimation of metal quantities is valid only in a sense of pseudo terms, meaning that the results are meaningful only in a statistical sense. Verification for the results is necessary in later stages of exploration and estimation.

### **21.8 Prediction with Dynamic Control Samples**

Most conventional resource analyses are constructed on the basis of extrapolation of some mathematical relations established in control areas into unknown areas (Pan and Harris 2000). Control areas are commonly employed in geodata integration and for the estimation of mineral resources of a relatively unexplored region. As such estimation is predicated upon the principle of *analogy*, the properties of the estimates are heavily reflective of (1) how good of a geological analogue the control area is of the unexplored region and (2) the economic reference for the estimated resources. When analogue and desired resource estimate is for economic and technologic conditions similar to those that induced the exploration and resource development of the control area, resource estimates produced by a mathematical model estimated on a control area may be unbiased. However, when economic or technologic references for the estimates differ or when the control area is not a good geologic analogue, resource estimates are biased and even totally wrong.

Two different approaches to improvement of estimation by mathematical models estimated on control areas are: (1) use only control areas that are exhaustively explored and (2) extend the mathematical model to include exploration variables (such as those defined in Pan and Harris (1991). Both of these solutions present difficulties however: (1) except for very small regions, there are few regions large enough to make good control areas that are exhaustively explored and (2) information on exploration activities generally is not available for regions large enough to make good control areas. When exploration variables are not explicitly included in the model, identification of an appropriate control area presents a difficult problem, for it must represent an unbiased sample of deposit occurrence and nonoccurrence for the relevant geologic environment. As noted by Chung et al. (1992), to compute unbiased estimates of the probability for deposit occurrence conditional upon a set of geologic attributes, it is necessary to know not only the distribution of various attributes in and near mineral deposits, but also the distribution of the same attributes away from mineral deposits (Cox 1990; Agterberg 2015).

Given the issues presented above, it is necessary to solve the dilemma in the selection of control areas and even method of extrapolations of these control areas into unknown regions. The nature of control areas so far is static, meaning that the control areas are fixed when a mathematical model established from these control areas is extended into unexplored regions. Clearly, this static model is hardly adequate for prediction of a large region with complex variability of geological conditions and mineralization characteristics. In other words, the mathematical model built on a basis of samples collected from a control area is only appropriate when the extrapolated areas have geological conditions identical to those in the control areas. It is deemed invalid when the geological conditions in the estimated areas differ from those in the control areas. Hence, a new concept is proposed here: dynamic control areas, which are characterized as self-improvement of the mathematical models through information gains of extrapolated areas away from the initial control areas. The methodology of dynamic control areas and extrapolation of mathematical models are implemented in three steps as follows:

(1) Select the best explored areas in the working region as the initial control area, from which control samples are collected. On the basis of this sample data, a mathematical model is established through data enhancement, combination of different datasets, and techniques of information synthesis. This mathematical model is then used as the initial model for extrapolation and prediction of unknown areas in the working region.


The model update above is in nature an iterative process, which improves predictability of the model in the unknown units. The initial control sample is only used for establishment of the initial mathematical model, which is then updated and optimized as it is extended into the predicted areas through incorporation of new information on the variability of geological environments.

**Acknowledgements** The author wishes to thank for the guidance of Dr. D. P. Harris for the subject and the useful comments provided by Dr. Frits Agterberg of Geological Survey of Canada and Dr. B. S. Daya Sagar of Indian Statistical Institute-Bangalore Centre.

### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **Chapter 22 Solving the Wrong Resource Assessment Problems Precisely**

**Donald A. Singer**

**Abstract** Samples are often taken to test whether they came from a specific population. These tests are performed at some level of significance (α). Even when the hypothesis is correct, we risk rejecting it in <sup>α</sup> percent of the cases—a Type I error. We also risk accepting it when it is not correct—a Type II error at <sup>β</sup> probability. In resource assessments much of the work is balancing these two kinds of errors. Remarkable advances in the last 40 years in mathematics, statistics, and computer sciences provide extremely powerful tools to solve many mineral resource problems. It is seldom recognized that perhaps the largest error—a third type—is solving the wrong problem. Most such errors are a result of the mismatch between information provided and information needed. Grade and tonnage or contained models can contain doubly counted deposits reported at different map scales with different names resulting in seriously flawed analyses because the studied population does not represent the target population of mineral resources. Among examples from mineral resource assessments are providing point estimates of quantities of recoverable materials that exist in Earth's crust. What decision is possible with that information? Without conditioning such estimates with grades, mineralogy, remoteness, and their associated uncertainties, costs cannot be considered, and possible availability of the resources to society cannot be evaluated. Examples include confusing mineral occurrences with rare economically desirable deposits. Another example is researching how to find the exposed deposits in an area that is already well explored whereas any undiscovered deposits are likely to be covered. Some ways to avoid some of these type III errors are presented. Errors of solving the wrong mineral resource problem can make a study's value negative.

**Keywords** Quantitative resource assessment ⋅ Decision analysis Uncertainty ⋅ Lognormal

© The Author(s) 2018 B. S. Daya Sagar et al. (eds.), *Handbook of Mathematical Geosciences*, https://doi.org/10.1007/978-3-319-78999-6\_22

D. A. Singer (✉)

<sup>10191</sup> N. Blaney Ave., Cupertino, CA 95014, USA e-mail: singer.finder@comcast.net

### **22.1 Introduction**

Howard Raiffa (1968, p. 264) noted that statistics students learn the importance of constantly balancing making an error of the first kind (that is, rejecting the null hypothesis when it is true) and an error of the second kind, that is, accepting the null hypothesis when it is false (Fig. 22.1). Raiffa thought it was John Tukey who suggested that practitioners all too often make errors of a third kind: of solving the wrong problem. Raiffa nominated a candidate for the error of the fourth kind: solving the right problem too late. John Tukey believed that it was better to find an approximate answer to the right question, than the exact answer to the wrong question, which can always be made precise. More recently, Mitroff and Silvers (2009) focused mostly on social questions where type III errors occurred and provided many examples of developing good answers to the wrong questions (type III error). Unfortunately concerns of Raiffa, Tukey, Mitroff, Silvers, and others are appropriate for mineral resource assessments. And the concerns should not be limited to classical statistics.

Supply of minerals to society is dependent not only on the total amount of mineral material but also on quality or concentrations, spatial distributions or how scattered the material is, whether it has been found, whether it is remote from infrastructure, and a whole host of other issues such as government policies, production technologies, and market structures. Decision-makers, whether concerned about development of a technology, development of a region, exploration, or land management, are faced with the dilemma of obtaining new information, or allowing or encouraging others to obtain it, and the possible benefits and costs of development if mineral deposits of value are discovered. Decisions about exploration for these resources and their possible development require awareness of various kinds and the import of errors that can be made by analysts in their studies.

A type I error is the rejection of the null hypothesis when it is true. In some fields a type I error is called a false positive. The risk of this error is α, the level of significance. A type II error is the acceptance of the null hypothesis when it is false, also known as a false negative error. The probability of making a Type II error, β,

**Fig. 22.1** Type I error is the rejection of the null hypothesis (Ho) when it is true. The risk of this is α, the level of significance. Type II error is the acceptance of the null hypothesis when it is false

depends on the alternative value and its distribution. The most important question of the analyst and decision-maker should be: Are we solving the right problem? It is the need to consider this source of error in mineral resource studies that is the focus of this chapter. Common to many of the errors of solving the wrong problem is a mismatch of the studied population and the population that is central to the decisions—this topic is presented first. Next, effects of mismatches of populations to some mineral resource assessments are discussed. Possible ways to avoid some of these type III errors are finally presented.

### **22.2 Target Population**

Type III errors are fundamental and should be considered before errors of types I and II. Type III errors stem from improper definition of the problem and therefore are not strictly a statistical issue, but one of critical thinking. It does no good to minimize the expected costs of type I and type II errors if the wrong problem is being solved. In mineral resource assessments, careless problem definition is the primary source of type III errors. For almost all resource assessment problems, the fundamental sample is the mineral deposit.

The idea of a mineral resource involves both geologic and economic aspects and because knowledge about the earth and future economic conditions is limited, should recognize uncertainty. Mineral deposits are the geologic entities containing resources. Mineral deposits and their contents are the fundamental target populations that are estimated. So what is a mineral deposit? Mineral deposits are defined as mineral occurrences of sufficient size and grade that they might, under favorable circumstances, be economic.

A map of some volcanogenic massive sulfide deposits from Northern Japan is used to clarify our understanding of what is a deposit (Fig. 22.2). From this plot one can see that some of the deposits are just a few meters apart from each other. Grade and tonnages are available for 23 of these named deposits from the western part of the Hokuroku district, Japan (Ohmoto and Takahashi 1983). It is important that if a different map scale were used, this part of the district might have three or four named deposits with grades and tonnages. This well-studied district has more detailed maps than many other volcanogenic massive sulfide districts around the world. If one gathered all available data on the names and grades and tonnages of volcanogenic massive sulfide deposits and built grade and tonnage or contained metal models, the models would contain metals double counted from deposits reported at different map scales and from the same deposits with different names due to grouping. To have a consistent sampling unit that can be applied in statistical analysis and in assessments of undiscovered deposits it is necessary to have spatial rules to help define a deposit. In addition, mine names and deposit names do not always match, mine names sometimes change over time, and district and deposits can be reported with different names and numbers. For example, careless data gathering might contain the grades and tonnage of the total Sudbury Ni-Cu District

**Fig. 22.2** Kuroko volcanogenic massive sulfide deposits of the western part of the Hokuroku district in Northern Japan (after Ohmoto and Takahashi 1983)

in Canada and also contain grades and tonnages of the many mines thus double counting and generating biased metal statistics and frequency distributions of questionable value. There are databases in which spatial rules for combining adjacent deposits have been consistently applied and multiple names have been eliminated (e.g., Mosier et al. 2009). Compilations that use the above sources combined with other sources of data on, for example, volcanic-hosted massive sulfide deposits very likely contain deposits and prospects counted twice (e.g., Patiño-Douce 2016), resulting in statistical analyses that are seriously flawed because the studied population does not represent the target population of mineral resources. Operational rules defining deposits need to account for these map scale effects and for the fact that some deposits have multiple names, mines and separate reported tonnages (Singer 2017).

Mineral occurrences or prospects which are the focus of prospectivity analysis do not qualify as economic mineral deposits because they are typically quite small and incompletely explored. Because number of undiscovered deposits estimates must be defined in a way that is linked to the grade-tonnage or contained metal models, estimates of number of deposits made using models based on such flawed grade-tonnage models must also be a mismatch with the target population.

### **22.3 Examples of Mismatches in Assessments**

Solving the wrong problem due to mismatches of the target population with the studied or estimated population abound in mineral resource assessments. Examples of mismatches include issues of not understanding where the undiscovered resources might exist and estimating something other than mineral deposits that might be economic to mine (De Young and Singer 1981).

In one example, five or more epithermal gold vein deposits were estimated at the 90% level but no grade-and-tonnage model was provided, so the estimated deposits could be any size (Singer and Menzie 2010). To provide critical information to decision-makers, a grade-and-tonnage or contained metal model is key, and the estimated number of deposits that might exist must be from the linked grade-and-tonnage frequency distributions. Estimates of number of undiscovered deposits are completely arbitrary unless tied to a grade-and-tonnage or contained metal model that has been defined in a consistent operational manner.

In an unpublished study, four geoscientists made subjective probabilistic estimates of the number of undiscovered hot-spring mercury deposits in a 1:250,000 scale quadrangle in Alaska. They made independent estimates at the 90th, 50th, and 10th percentiles (Table 22.1). The 10th percentile, for example, is the number of deposits for which there is at least a 10% chance of that number of deposits or more exist.

It was pointed out to participant D that because the number of deposit estimates must be consistent with the grade and tonnage model, his estimates imply that there is more undiscovered mercury in this quadrangle than has been found in the world


**Table 22.1** Independent estimates by four scientists of the number of undiscovered hot-spring Hg deposits in a quadrangle in Alaska

in this deposit type. He responded that he was estimating wisps of cinnabar, not deposits consistent with the grade and tonnage model. In this case, the population considered by participant D did not match the target population. Using a variety of different guidelines such as deposit densities (Singer 2008) for estimates of the number of undiscovered deposits provides a useful crosscheck of assumptions that may have been relied upon and discourages mismatches between target and estimated populations. In these examples of errors in estimating the number of undiscovered deposits, the key is the difference between the understanding of what was being estimated and the population of interest. In Harris's landmark study (1965), multiple discriminate analysis was used to predict value of mineral production—among the best predictors was geologic cover

with a negative value. In a study by Singer (1971), multiple regression was used to predict mineral production and again, cover with a negative value was an important variable. Unlike in petroleum exploration, minerals exploration under cover is a developing technology. Most commonly, mineral exploration under cover results from trying to extend known deposits, that is, additions to reserves. More difficult discovery and higher costs relative to exposed deposits, tend to reduce interest in covered areas. Covered areas tend to be poorly explored and, consequently, deposits under cover tend to be underreported.

In situations where resource assessments are made based on local information, the possibility of solving the wrong problem is high. For example, if the mapped geology were used to predict where and how many undiscovered orogenic gold deposits might in the Bendigo Zone of Victoria Australia, one would conclude that deposits are clustered in space and gold deposits are related to older rocks and covered areas would be worst place to look (Fig. 22.3). Even if we use some modern tools like weights of evidence or neural networks, we would predict no undiscovered deposits under cover. Yet, because geology permissive for the gold deposits is known under cover, and exposed permissive geology is thoroughly explored, most experts would recommend exploration under cover (Lisitin et al. 2007).

Each of these examples demonstrates mismatches of the target population and the studied population. Type III errors in these cases could produce useless or, even worse, misleading assessments.

**Fig. 22.3** Geology and known orogenic gold deposits (black) in the Bendigo Zone of Victoria, Australia (modified after Lisitsin et al. 2007)

### **22.4 How to Correct Type III Errors**

The problems of mineral resource assessment can only be solved if they are formulated in a way consistent with the decision-maker's language and understanding of the problem. The questions need to be asked: Why perform an assessment? Who is the study being done for and what are the problems they are trying to resolve?

We start with the question of what kinds of issues decision makers are trying to resolve and what types and forms of information would aid in resolving these issues. Unfortunately, the decision-maker may not be available for the needed insight or may not be able to clearly state the information needs. Because the primary purpose of the kinds of assessments recommended here is to help decision-makers determine consequences of economic and policy decisions about tracts of land, regions, countries, or the earth, it is critical that the assessments be unbiased. For example, if the question concerns the long-term supply of a metal, the data used should not contain biased information such as grades and tonnages on multiple versions of the same deposits. These situations require care in compiling data and using sources that report locations, other names of deposits and names of deposits that have been combined with the primary deposit to meet spatial combination rules. A reliable source (e.g., Mosier et al. 2009) has specific information about locations, rules used to combine deposits and specific names that were combined for each deposit. These kinds of data provide a reliable basis for testing statistical distributions of metals in mineral deposits such as the lognormal distribution (Singer 2013).

It is important to recognize that success of assessments depends on the assessments following an integrated approach. This means that no part of the models and methods of estimation have any meaning in isolation. For instance, estimates of number of undiscovered deposits are completely arbitrary unless tied to a grade and tonnage or contained metal model. The goal should be to make explicit the factors that can affect a mineral-related decision so that the decision-maker can clearly see what are the possible consequences of decisions (Singer and Menzie 2010).

To avoid situations where occurrences are the basis of information used to discriminate barren areas from the economic deposits sought, it is necessary to construct models based on the economic deposits sought. Mineral deposit models can be based on data gathered from well-explored deposits of each type from around the world. This would allow the determination of how commonly different attributes and combinations of attributes occur. Quantifying mineral deposit attributes is the necessary and sufficient next step in statistically classifying known deposits by type. Quantified deposit attributes also can provide a firm foundation to identify which observations on geologic and other maps should be effective in delineation of tracts and perhaps identifying sites for detailed exploration. The kind of digital models advocated here would require the recording of both absolute time units and the relative time units of spatially related mineral deposits, rocks, geochemistry, geophysics, and tectonics. The scale of the observations is critical to proper application of such models. This is required to properly apply the models in new geologic settings. Information in these models about the attributes associated with known deposits is necessary but not sufficient to discriminate barren from mineralized environments; quantifying the attributes of barren environments also is necessary for this task. Such digital models could be the foundation for identifying the discriminating functions that could remove many type III errors in assessments.

The exploration department of a major zinc producer found it essential to document a robust decision-making process to maintain internal and investor support (Penney et al. 2004). Zinc deposits from around the world were classed by type, grade, and tonnage models developed for each, cost filters were applied to each, and tracts around the world were delineated where the types could occur (Penney et al. 2004). This study was designed to aid the exploration decision-makers plan the search for economic deposits. Their process was the same as that recommended in three-part assessments (Singer and Menzie 2010), with the exception that they ranked or scored tracts rather than estimating the number of undiscovered deposits.

### **22.5 Conclusions**

Errors of solving the wrong mineral resource problem can make a study's value negative. Type III errors, solving the wrong problem, can be avoided by using care in matching the information needed to solve the decision-maker's problem with information provided in the study. In some cases, we know how to solve the wrong problem but not the real one. It is not uncommon to get rewarded for publishing an answer—not THE answer. With some care and critical thinking in the planning stages, it is possible to provide information useful to decision-makers and to be rewarded for a publication.

### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. The images or other third party material in this book are included in the chapter's Creative

Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **Chapter 23 Two Ideas for Analysis of Multivariate Geochemical Survey Data: Proximity Regression and Principal Component Residuals**

### **G. F. Bonham-Carter and E. C. Grunsky**

**Abstract** Proximity regression is an exploratory method to predict multielement haloes (and multielement 'vectors') around a geological feature, such as a mineral deposit. It uses multiple regression directly to predict proximity to a geological feature (the response variable) from selected geochemical elements (explanatory variables). Lithogeochemical data from the Ben Nevis map area (Ontario, Canada) is used as an example application. The regression model was trained with geochemical samples occurring within 3 km of the Canagau Mines deposit. The resulting multielement model predicts the proximity to another prospective area, the Croxall property, where similar mineralization occurs, and model coefficients may help in understanding what constitutes a good multielement vector to mineralization. The approach can also be applied in 3-D situations to borehole data to predict presence of multielement geochemical haloes around an orebody. Residual principal components analysis is another exploratory multivariate method. After applying a conventional principal components analysis, a subset of PCs is used as explanatory variables to predict a selected (single) element, separating the element into predicted and residual parts to facilitate interpretation. The method is illustrated using lake sediment data from Nunavut Territory, Canada to separate uranium associated with two different granites, the Nueltin granite and the Hudson granite. This approach has the potential to facilitate the interpretation of multielement data that has been affected by multiple geological processes, often the situation with surficial geochemical surveys.

E. C. Grunsky

G. F. Bonham-Carter (✉)

<sup>110</sup> Aaron Merrick Drive, Merrickville, ON K0G 1N0, Canada e-mail: graeme.bc1@gmail.com

Department of Earth and Environmental Sciences, University of Waterloo, Waterloo, ON N2L 3G1, Canada

**Keywords** Ben Nevis area ⋅ Nunavut ⋅ Multivariate geochemistry Lake sediment surveys ⋅ Regression ⋅ Principal components Spatial modelling ⋅ Proximity ⋅ Residuals

### **23.1 Introduction**

Proximity to selected spatial features on geological maps has been used in the analysis of multivariate data in several ways, but usually as a weighting function not as a variable to be directly predicted. For example, Cheng et al (2011) describe "spatially weighted principal component analysis" to emphasize proximity to selected intrusions in the analysis of geochemical patterns. This involves using spatial weights (in range 0–1) to calculate weighted correlation coefficients, before the usual eigenvector determinations of principal components analysis. The resulting weighted principal component scores were mapped to predict element associations related to intrusions. Brunsdon et al. (1998 and other papers) have used "geographically weighted regression" to analyze long-term illness data from a UK census. This approach recognizes that a regression may often not be spatially stationary, but will show changes geographically. Again, the regression equations use spatial variables as weights. In both these examples, proximity to some feature is introduced as a spatial weight, not as a response variable for direct prediction.

In the first part of this chapter we suggest that proximity to a geological feature can be more directly studied by using proximity itself as a response variable in a regression using a collection of geochemical elements as explanatory variables. In regional geochemical surveys, one may be interested in understanding which variables are good predictors of proximity to a mineral deposit, or to some other selected feature with known location. This is frequently referred to in mineral exploration as finding good 'vectors' to mineralization, but as far as we are aware direct prediction of proximity from multielement data has not been published, although plots of single elements, or element ratios, on profiles showing distance to known mineralization are often used. If a good predictive suite of elements can be determined (either from understanding a genetic model or from empirical tests) and based on a training set of samples relatively close to the geological feature of interest, the resulting predictive equation can be used to look for similar associations outside the training area. If the feature of interest is a mineral deposit, this approach may be useful in finding new deposits. This may be used both for 2-D regional geochemical surveys, and in 3-D geochemical data from borehole data.

The second part of the chapter is about using residual principal components analysis (PCA) of multielement geochemical data. PCA has been widely used by exploration geochemists and others to understand multielement geochemical processes, particularly in surficial geochemical surveys, but also in lithogeochemical data collected at surface or in boreholes. This literature is large, and here we refer as an example to a study of soil geochemistry as measured along two continental scale transects of North America. PCA of logratio-transformed variables revealed the effects of soil-forming processes, including soil parent material, weathering, and soil age as interpreted from PCs (Drew et al. 2010). There are many examples of successful geological interpretations by PC analysis. Individual PCs can often be interpreted both from variable loadings, from biplots and from spatial patterns seen by mapping PC scores (e.g. Grunsky 2010).

Sometimes, however, one may be interested in the spatial distribution of a single geochemical element, and it is desirable to remove the effect of some particular geological process or processes that are reflected in one or more PCs. For example, in the analysis of till geochemical surveys, the first PC is often interpreted as due to the effect of till transport. Thus it may be desirable to look at the element distribution after removing PC1. Usually this is carried out by progressively examining element loadings and the spatial patterns seen by mapping PC scores. However, there may be situations where it is helpful to examine spatial patterns of a single element after removing PC1 (or several PCs). This can be achieved what we are terming here as "principal component regression". This is a straightforward regression using the selected element as the response variable, and PC1 (or PC combination) as the explanatory variable(s). The residuals (the observed response variable minus the predicted response variable) provide the desired element distribution after removing the effect of PC1 (or PC combination). If PC1 is interpreted as due to till transport, then the residuals represent the element values after removing the effect of till transport.

This approach represents a process that is somewhat analogous to a geochemical selective leach separating a mineral phase or perhaps several mineral phases. A 'total' analysis is designed to dissolve all mineral phases, whereas a partial leach targets a selected mineral phase. The element under study can thereby be partitioned into phases by selective leaching. Residual PCA also separates the element under study into parts, although the partitions are not the same as those targeted in selective leaches. The partitions in residual PCA are related to proportions of an element quantity that can be 'explained' by different multivariable associations as determined by PCA. Residual PCA was first used by Bonham-Carter and Hall (2010) in a study of uranium in soils in the Athabasca Basin. Residual U, after removing the effect of till transport (as determined by PCA), was a better predictor of buried mineralization than raw U values in A-horizon soils.

In this chapter, we use a lithogeochemical dataset from the Ben Nevis area of Ontario to illustrate proximity regression, and a lake-sediment dataset from southern Nunavut to illustrate residual principal components analysis.

### **23.2 Method 1: Direct Prediction of Spatial Proximity**

Suppose we have an array of geochemical data, with rows being samples, and elements as columns. In addition, we have distance measurements for each sample reflecting the shortest distance from the sample to some geological feature (mineral deposit, an intrusion, a fault, etc.). Before multivariate analysis, it will be important to transform the element variables by a centred logratio, to overcome the effects of closure (Aitchison 1986; Buccianti et al. 2006; and many other papers).

Although distances may be used directly, we have found that transforming distance to proximity gives somewhat better predictions. If for example the goal is to model the dispersion 'halo' around a deposit, the decay of the halo effect with distance from the contact may be exponential, or may follow a power law. Thus, instead of using distance as a response variable, we often get better results by transforming distances inversely to proximities. Here we have used a simple exponential decay of proximity with distance, that assumes that the rate of decay of proximity with distance is constant, similar to the familiar model of decay of a radioactive element with time. Let distance be denoted as *Z* (metres from feature) and proximity by *Y* (in range 1, 0 where 1 is at zero distance decreasing to zero at infinitely large distances), then the rate of decay of proximity with distance is assumed to be a constant

$$\frac{\mathrm{d}Y}{\mathrm{d}Z} = -a.\tag{23.1}$$

Integrating (23.1) from distance 0 to *Z* leads to:

$$Y(Z) = Y(0)\mathbf{e}^{-\alpha z}.\tag{23.2}$$

The value of proximity at zero distance *Y*(0) = 1, so this term drops out. It is also convenient to define the 'half distance' *Z*0.5 where proximity *Y* equals 0.5, then by rearranging Eq. 23.2 we can express α in terms of the half-distance:

$$\alpha = \frac{-\ln 0.5}{Z\_{0.5}}.\tag{23.3}$$

Substituting for α in (23.1), distance can then be transformed to proximity from

$$Y(Z) = \exp\left(\frac{\ln 0.5}{z\_{0.5}} \cdot Z\right) \tag{23.4}$$

We note that an alternative approach was used by Cheng et al. (2011) in the spatially weighted principal components to determine spatial weights W (equivalent to proximities) using a power relation:

$$W = \left(\frac{1 - Z}{Z\_{\text{max}}}\right)^{\mathcal{I}} \tag{23.5}$$

where γ is a power parameter, and *Z*max is a selected maximum distance for modelling. For γ = 0, all weights = 1, with γ = 1, weights are a linear inverse of distance, but positive values of gamma such as 2, 8, 16 define a power-law decrease of proximity with increasing distance.

**Fig. 23.1** Left. Example of relationship between proximity and distance using exponential decay with a 'half-distance' parameter. Proximity = 1 at distance = 0, proximity = 0.5 at distance = half-distance. Right. Similar to left diagram, but using power law model with gamma parameter

Typical exponential curves and power law curves using Eqs. 23.4 and 23.5 are shown in Fig. 23.1.

We now model proximity with a training set of samples (chosen within some arbitrary but reasonable distance from the selected feature) using selected geochemical variables.

Then let *X* be the matrix of CLR-transformed element values, with rows as samples, columns as elements. The geochemical elements are the explanatory variables, and the column vector Y contains the proximity values, the response variable. The geochemistry is used to 'explain' the response. Here we used multiple linear regression to model this relationship, although other approaches could be taken.

$$\mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\Xi} \tag{23.6}$$

where β is a column vector of coefficients to be determined by least squares, and ϵ is the vector of errors. The coefficients are solved from the normal equations

$$\boldsymbol{\beta} = \left(\mathbf{X}^{\prime}\mathbf{X}\right)^{-1}\left(\mathbf{X}^{\prime}\mathbf{Y}\right) \tag{23.7}$$

where X' is the transpose of X and (X′X)−<sup>1</sup> is the inverse of X′X.

If inspection of the coefficients and goodness of fit are satisfactory, the predicted values of proximity, Ŷ, are calculated from

$$
\hat{\mathbf{Y}} = \mathbf{X}\boldsymbol{\beta}.\tag{23.8}
$$

### *23.2.1 Application of Proximity Regression with Ben Nevis Lithogeochemical Data*

#### **23.2.1.1 Background Geology**

The Ben Nevis Township area is part of the Blake River Group (Fig. 23.2) a calc-alkaline volcanic sequence. The same sequence extends eastward to the Noranda area of Quebec where major Cu-Zn-Ag deposits are located. Extensive alteration and mineralization was recognized in the Ben Nevis area (Jensen 1975; Wolfe 1977), which led to a later geochemical study by Wolfe (1977) with emphasis on the metal distribution of stratiform volcanogenic sulphide deposits in Archean volcanic rocks. Lithogeochemical sampling was undertaken across the area by Jensen (1975) and Wolfe (1977) followed by additional sampling by Grunsky (1986a, b). Grunsky and Agterberg (1988) and Grunsky (1986a, b) carried out a detailed a multivariate geostatistical investigation of these data. A regional multi-element geochemical study over the Abitibi Greenstone Belt was later undertaken by Grunsky (2013) in which multivariate statistical methods were applied to recognize lithological variation, areas of alteration and potential base-metal mineralization.

The principal lithologies of the study area are basaltic pillowed flows, pillow breccias and breccias of calc-alkaline affinity (Grunsky 1986a). Two felsic volcanic units comprised of tuff, tuff breccia and flows of rhyolitic and dacitic composition occur within the basaltic sequence. The volcanic sequence has been intruded by tholeiitic gabbroic and diorite bodies throughout (Fig. 23.3). More recent studies of

**Fig. 23.2** Location map of Ben Nevis study area adapted from Grunsky (1986a)

**Fig. 23.3** Geology of Ben Nevis area, adapted from Grunsky (1986a). Note locations of Canagau Mines deposit and Croxall property. Figure from Grunsky (1986b)

the volcanic assemblage in the context of the Abitibi Greenstone Belt are described by Pelogquin et al. (2008).

Within the area, the two most significant mineral occurrences are the Canagau Mines deposit and the Croxall property. The Canagau Mines deposit is dominated by strongly carbonatized, sericitized, and silicified mafic and felsic volcanic rocks. Mineralization consists of sphalerite, gold, silver, galena, chalcopyrite, and pyrite within east-trending fractures and shear zones that dip 40–60° south. Tonnages are unknown, and the grade is as high as 11 ppm gold and 22 ppm silver. The area was extensively explored by Wallbridge Mining in 2004 (Wallbridge 2004) and a report on exploration activities by Meyer et al. (2004). The deposit is currently considered to be uneconomic. The Au-Ag-Cu-Pb-Zn style of mineralization is typical of an epithermal system.

The Croxall property consists of a zone of brecciated and sheared rhyolite with interstitial pyrite, chalcopyrite, chlorite, calcite and quartz. Gold assays have been reported up to 1 ppm.

Grunsky (1986a, b) showed that multivariable data analysis techniques distinguish the altered from unaltered volcanic rocks.

### **23.2.1.2 Application**

The purpose of this application is to determine whether a multielement signature can be identified related to proximity to the Canagau Mines deposit, then use this signature to look for other places with similar patterns.

The distances between each sample and the Canagau Mines deposit was calculated using the eastings and northings associated with each sample, plus the known location of the deposit. Distances were converted to proximities using Eq. (23.4). Different proximity vectors were calculated for half-distances of 100, 300, 500, 800, 1000 and 1500 m so that an optimal half distance parameter could be determined. Figure 23.4 shows the sample points with proximity (half distance equal to 800 m) classified by colour and dot size. The training set comprises all points lying within 3 km of the deposit (equivalent to points with proximity greater than exp(ln(0.5) \* 3000/800) = 0.074).

There are 26 geochemical variables in the dataset—a mixture of trace elements and major oxides. After converting all elements to a common unit of measurement (ppm), all chemical variables were transformed by centred logratios (CLR) to avoid the problem of closure. Using the training samples, correlation coefficients were calculated between each element (CLR-transforms) and proximity. These correlations were sorted by magnitude and used to reduce the number of elements selected to predict proximity by multiple regression analysis. Elements were selected for Model 1 if the absolute value of correlation (Pearson's r) with proximity was greater

**Fig. 23.4** Map showing locations of lithogeochemical samples, with size and colour of dots related to proximity to Canagau Mines deposit (Fig. 23.3). Training set for regression model includes only those samples within 3 km of deposit (within circle)

**Table 23.1** Result of multiple linear regression. Variables selected for regression against proximity (Model 1) by selecting those with abs (correlation coefficient) > 0.2. The explanatory variables are CLR-transformed geochemical element values, the response variable is proximity to the Canagau Mines deposit, using n = 278 samples that lie within 3 km of the deposit for training. Variables selected for Model 2 based on p-values < 0.03 from Model 1


than 0.2 (Table 23.1). This reduced the number of elements to be used as explanatory variables from 26 to 11.

CLR variables were not further transformed, and the coefficients and associated probabilities obtained by using Eq. (23.7) are shown in Table 23.1. Note that Co, Li, Pb and CO2 have positive coefficients, whereas Ni, Sr, V, CaO, Na2O, K2O, TiO2 and S have negative coefficients. This model has a goodness-of-fit of about 40% (adjusted R2 = 0.399). A second model was then run to remove those variables in Model 1 with p-values greater than 0.03. In Model 2, CO2 is the only variable with a positive coefficient, and V, CaO, Na2O and K2O have negative coefficients. The goodness-of-fit of Model 2 is almost the same as Model 1, with adjusted R2 = 0.394. Although not shown here, a plot of predicted values from Model 1 and Model 2 are highly correlated, and maps of each are virtually indistinguishable.

The predicted values of proximity are shown in Fig. 23.5 for both the training and non-training samples. As expected, the Canagau Mines deposit shows up as a 'bullseye' at the centre of the training sample area. Notice that the Croxall property shows as another less prominent bullseye to the west, in the non-training sample area. Other high values of predicted proximity to the south of the Canagau Mines deposit and northeast of the Croxall property are associated with known sulphide occurrences as shown in Fig. 23.3. Thus, we can conclude that proximity regression led to the selection of a suite of useful explanatory variables that, after training on the Canagau Mines deposit, was able to 'discover' the Croxall property.

**Fig. 23.5** Map showing predicted proximity to Canagau Mines deposit. Plot includes both points used in training (those within 3 km of deposit) and other sample points. Croxall property is identified with large proximity values by this model

**Fig. 23.6** Plot of observed proximity versus predicted proximity, with best fit line, training points only. In general, fit is noisier at lower values of proximity. Points with proximity >0.5 (i.e. within the 'half-distance' of 800 m of the Canagau Mines deposit) show a stronger relationship

**Fig. 23.7** Variation in goodness of fit (adjusted R2 ) with changes in 'half distance', the parameter used to control rate of exponential decay of proximity with increasing distance (23.4). Note that curve shows that relationship is strongest using half-distance parameter = 800 m

A bivariate plot of observed versus predicted proximity, training points only, (Fig. 23.6) shows that the relationship is noisier far away from the deposit than closer to it, consistent with the proximity response weakening at increasing distance.

Experimental results show that an optimum half distance for modelling proximity as an inverse function of distance is 800 m, although the results are not very sensitive to changes in the 300–1000 m range (Fig. 23.7). It is not clear how useful this parameter might be in describing the geometry of the 'halo' effect around the deposit.

### **23.3 Method 2: Principal Component Residuals**

Many geochemical survey data are difficult to interpret, because multiple overlapping processes affect element levels in space and time. In some situations, a principal component will show a composition (based on element loadings) and a spatial pattern reflecting an interpretable geological process, but usually interpretation is complex because of interacting processes.

Residual principal components analysis is an exploratory approach that can sometimes be helpful in sorting out complex multielement interactions. The method is a straightforward extension of applying principal components, followed by a series of multiple linear regressions. As with the proximity regression method, it is important first to carry out a centred log ratio transform of all the elements, otherwise distortions may occur in principal component (and subsequent multiple regression) results due to constant sum 'closure' effects.

Regular PCA is carried out in the usual way on the correlation matrix calculated from CLR-transformed element variables (e.g. Davis 2002, ch. 6).

Inspection of the eigenvectors for each PC, inspecting biplots, and mapping PC scores for the at least the first few PCs can then lead to an interpretation of PCs in terms of geological processes (Grunsky 2010). Here the objective is to focus on a selected element to separate out ('partition') this element compositionally and spatially using the principal component results.

For the element of interest, the next step is to inspect the corresponding row of the eigenvector matrix (the 'loadings') to understand better in which components the element occurs. It may be decided to predict the element from PC1 only, or from PC1 and PC2, or PC1, PC2 and PC3, and so on. For each of these selections, a multiple regression is carried out with the selected PCs as explanatory variables, and the chosen element as the response variable. For example, if the response variable is *V* and the explanatory variables are PCs 1 to PC3, then

$$V = \pounds\_0 + \pounds\_1 PC\_1 + \pounds\_2 PC\_2 + \pounds\_3 PC\_3 + \qleftarrow \tag{23.9}$$

can be solved as before for the coefficients β by least squares. If the predicted values of *V* are *V*\*, then the residuals *VR* are simply

$$V\_R = V - V^\* \tag{23.10}$$

computed over all sample locations.

The choice of PCs in Eq. (23.9) may be as simple or as complex as needed. We have had good results by successively adding PCs, inspecting the goodness of fit at each stage and mapping the predicted and residual values at each step. Inspection of residual patterns may reveal, spatially, where concentrations of that particular element are distributed, facilitating interpretation.

In this method, there is no training set, calculations are carried out on all samples.

### *23.3.1 Application to Nunavut Lake Sediment Data*

### **23.3.1.1 Geological Background**

The lake sediment survey was carried out over three 1:250,000 scale map areas (NTS 65A, 65B, 65C) in southern Nunavut Territory, Canada (McCurdy et al. 2012). The geology of two of the NTS sheets (65A, 65B) were mapped by Eade (1973) and is shown in Fig. 23.8. Of particular interest to this study, we notice that there are two important granitic intrusion types: the Hudson granite (1.83 Ga) and the Nueltin granite (1.75 Ga) suites as identified and characterized by Peterson et al. (2015).

This area lies within the southern Hearne Province, a poorly understood terrane. The domain is dominantly comprised of Archean tonalitic and charnokitic gneisses, approximately 2.8 Ga in age. However, strong evidence for fragments of much older crust, up to 3.3 Ga, has been found in the form of inherited Archean zircons and Sm–Nd model ages obtained from Proterozoic post-orogenic plutons of the Hudson granite, intruded at about 1.83 Ga. Nueltin rapakivi granite (ca. 1.75 Ga) is also present in the area.

A comprehensive multielement study of the lake sediment data was carried out by Grunsky et al. (2012a, b), and by Grunsky and Kjarsgard (2016). One of the results of those studies was to show that the multivariate geochemistry could be used to map the various rock types using a variety of methods including PCA.

**Fig. 23.8** Geological map of NTS sheets 65A, 65B and 65C, with coordinates shown for UTM Zone 14, Nunavut Territory, adapted from Grunsky et al. (2012a, b). Two units noted in text are Nueltin granite (Pp-Ng shown in orange) occurring in west and Hudson granite (Pp-Hgr shown in light pink) occurring in east

### **23.3.1.2 Application**

The data consists of 1611 samples and 48 geochemical elements—both major and traces. Prior to CLR transformation, all variables were converted to ppm. PCA was carried out on all 48 elements. The objective was to understand better how uranium is partitioned between the two granites: the Nueltin and the Hudson.

PCA analysis was calculated on all 48 CLR transformed variables. A scree plot (Fig. 23.9a) shows that the first 15 PCs (out of the full 48) account for almost 85% of the total variation in the data, and the first 5 PCs account for over 60%. Inspection of the uranium loadings (Fig. 23.9b) shows that PCs 2 and 3 both have high positive loadings, whereas PC 5 has a strong negative loading. Multiple regressions were carried out (using U-CLR, **not** untransformed U) starting with

**Fig. 23.9 a** Scree plot showing cumulative variation explained by first 15 PCs. **b** Values of loadings for U-CLR on first 15 PCs

PC1, then successively adding PCs up to 12. For each regression, predicted U and residual U were calculated and mapped (not shown here), and a record made of the goodness of fit (Fig. 23.10). This graph shows that PC1 does not account for much U variation, but PCs 2 and 3 show marked increases in goodness of fit. PC 4 shows a minor increase, and PC5 shows a major increase. After PC5, improvements in goodness of fit are minor.

Figure 23.11 shows maps of U-CLR predicted from PCs 1-5, and U-CLR residuals. Not shown is the unmodified U-CLR map (which sums these two parts). Notable here is that the predicted map shows a pattern strongly correlated with the Nueltin granite, whereas the residual map is strongly correlated with the Hudson granite. PCs 1-5 'explain' the uranium in the Nueltin granite, whereas the residual uranium is that which occurs in the Hudson granite. The residual PC analysis has partitioned uranium into two parts that have a distinct geological interpretation.

This is confirmed in Fig. 23.12 which shows for the successive regressions results of t-tests on the mean U-residual in the Nueltin and Hudson granites. The value of t increases up to PC5, then decreases. This confirms that, for partitioning uranium between the two granites, regression against PC1-5 gives the best result.

**Fig. 23.11** Left. Map of U-CLR predicted from PCs 1-5 using lake sediment data. Right. Map of residual U-CLR unexplained by PCs 1-5. Predicted uranium is strongly related to presence of Nueltin granite, whereas residual uranium is strongly related to presence of Hudson granite. Map of total U-CLR does not distinguish between these two granites

### *23.3.2 Discussion*

These two methods add to the already large basket of multivariate methods useful for interpreting regional geochemical surveys.

With the wide use of GIS, spatial information is now easily determined for many features of map data. Distance calculations from points to points, points to lines and points to polygons are now routine, allowing the spatial characterization of proximity of geochemical samples to mineral deposits (points—depending on map scale), to faults of specified contacts (lines), or to rock units (polygons). In 3-D, proximity of geochemical samples to an orebody using borehole data is also straightforward. There are therefore many potential applications of proximity regression for a variety of situations involving multivariate geochemical data.

One particular idea that may be worthy of investigation is the application of this approach to prospectivity mapping. Instead of treating known mineral occurrences as binary points to be predicted from a series of evidential layers (weights of evidence, logistic regression, neural networks, etc.), a response variable could be constructed showing distance (or proximity) to the nearest mineral occurrence. The explanatory variables can be various evidential layers, as usual. The result would not be the probability of occurrence of a mineral occurrence, but rather the predicted proximity to the nearest mineral occurrence.

It should also be noted that proximity regression as described here has used ordinary multiple linear regression, so although the observed proximity measure in in the range (0, 1), predicted proximities are unconstrained and may be greater than 1 or negative. There might be some advantage to using logistic regression, that would automatically constrain the expected proximity to the range (0, 1), and would also allow the use of non-numeric explanatory variables (e.g. presence/absence of geological units, etc.). Alternatively, there are several neural network approaches that could also be tried for predicting proximity.

It should be noted that when doing a residual PCA on geochemical data, that logratio transforms are essential, because the effect of closure for introducing artefacts in PCA results is well known. Experience has also shown that residual analysis requires that the geochemical element used as a response variable must also be CLR transformed, as regression results are poor if untransformed response variables are used in the analysis.

In the separation of uranium between the Nueltin and Hudson granites, it would be most interesting to determine whether this partition was also related to isotopic differences. But this would require isotopic analyses of the lake sediment samples, an expensive proposition.

### **23.4 Conclusions**

Proximity analysis allows for the use of multielement geochemical data for direct prediction of proximity to geological features, such as mineralization, faults and intrusions.

Application of proximity analysis to lithogeochemical data from the Ben Nevis area showed that a suite of elements provided a good prediction of proximity to the Canagau Mines deposit, and that this model also predicted the Croxall property and other nearby sulphide occurrences.

Residual principal components analysis is a useful way to partition particular geochemical elements that can facilitate geological interpretation.

For example, uranium in a lake sediment survey could be partitioned into two groups based on PCs. Uranium associated with PCs 1-5 is strongly correlated with the Nueltin granite, whereas, residual uranium, after removing the effects of PC 1-5, is strongly correlated with the Hudson granite.

### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **Chapter 24 Mathematical Minerals: A History of Petrophysical Petrography**

**John H. Doveton**

**Abstract** The quantitative estimation of mineralogy from wireline petrophysical logs began as an analytical stepchild. The calculation of porosity in reservoir lithologies is affected by mineral variability, and methods were developed to eliminate these components. Simple inversion methods were applied in pioneer applications by mainframe computers to a limited suite of digital log data. Over time, the value of lithological characterization of reservoirs and resource plays has been recognized. At the same time, the introduction of newer petrophysical measurements, particularly geochemical logs, in conjunction with increasingly sophisticated algorithms, has increased confidence in mineral profiles from logs as a routine evaluation tool.

### **24.1 Pioneering Computer Methods**

The volumetric determination of mineral composition from petrophysical logs originated in efforts to estimate reliable porosity estimates that were confounded by variations in rock mineralogy. When Archie (1950) introduced the term 'petrophysics' he framed it in terms of "the physics of particular rock types" and then elaborated on the petrophysics of reservoir rocks. The petrophysical properties that he considered were restricted entirely to those "related to the pore and fluid distribution". The reason was obvious in that almost all boreholes were drilled for the location of either hydrocarbons or useable water in commercial quantities. The mineralogy of the pore framework complemented the fluid content of the pore network, but estimations would be focused on the evaluation of pore volume, permeability, and fluid content. In monominerallic rocks, pore volumes could be estimated very simply by interpolating between two endpoints of mineral and fluid.

© The Author(s) 2018 B. S. Daya Sagar et al. (eds.), *Handbook of Mathematical Geosciences*, https://doi.org/10.1007/978-3-319-78999-6\_24

J. H. Doveton (✉)

Kansas Geological Survey, Lawrence, KS 66047, USA e-mail: doveton@kgs.ku.edu

In multiminerallic rocks, porosity estimates became more difficult and significant errors were introduced if the mineral properties were radically different from one another.

Probably the earliest application of a mathematical solution to the resolution of porosity in a multiminerallic rock was directed to Permian carbonate reservoirs in West Texas. Petrophysicists were frustrated by complex mineralogy in their attempts to obtain reliable porosity estimates from logs as described by Savre (1963). Porosities had been commonly estimated from neutron logs, but values were excessively high in zones that contained gypsum, caused by the hydrogen within the water of crystallization. If the density log was used, then porosity estimation was compromised by the occurrence of either anhydrite or gypsum. Collectively, the mix of dolomite, anhydrite, gypsum, and porosity meant that pore volumes could not be resolved by graphical methods such as crossplots and nomograms that were the standard procedures of that time.

It was recognized that lithologies composed of several minerals would require several porosity logs to be run in combination in order to estimate volumetric porosity. In the most simple solution model, the proportions of multiple components together with porosity could be estimated from a set of simultaneous equations for the measured log responses. These equations can be written in matrix algebra form as:

### *CV* =*L*

where *C* is a matrix of the component petrophysical properties, *V* is a vector of the component unknown proportions, and *L* is a vector of the log responses of the evaluated zone. The equation set describes a linear model that links the log measurements with the component mineral properties. Although porosity represents the proportion of voids within the rock, the pore space is filled with fluid whose physical properties make it a "mineral" component. The set of equations is then solved as an "inverse problem", in which rock composition is deduced from the logging measurements. As a closed system of dolomite, anhydrite, gypsum, and porosity, a deterministic solution is possible from three log inputs, which were chosen as neutron, density, and acoustic velocity log measurements. The solution for the unknown vector, *V* is:

$$V = C^{-1}L$$

where *C*−<sup>1</sup> is the inverse of the *C* matrix.

Savre (1963) described how this procedure was coded in a computer program, as a pioneer application of computers to petrophysics. An example of the graphical output drafted from one of the earliest computer runs is shown in Fig. 24.1 (Alger et al. 1963), where profiles of porosity, dolomite, anhydrite, and gypsum are shown from a Permian San Andres Formation section in West Texas. At the time that this early application was made, computing power was typically provided by a single

**Fig. 24.1** Graphical output profiles of porosity, dolomite, anhydrite, and gypsum from one of the earliest computer runs that processed neutron, sonic, and density logs of a Permian San Andres Formation section in West Texas (from Savre 1963)

mainframe computer in the company or university which had extended computing times and limited memory, while programming code was a specialized and time-consuming task. The same application is very easy to implement today as a spreadsheet procedure, using standard matrix functions and graphical outputs.

The inverse solution is a simple and powerful procedure for compositional analysis, but its simplicity carries certain assumptions that must be considered carefully. In particular, the basic model contains no intrinsic constraint to preclude negative estimates of compositional proportions. A unity equation dictates the closure of the system so that the proportions collectively sum to unity. However, individual proportions can have a negative value or one that exceeds unity. Rather than representing mathematical error, apparently anomalous zones are located outside the composition space defined by the mineral endmembers as vertices. Consequently, the generation of negative proportions is a perfectly natural consequence of the model and can contain useful feedback information. If the negative values are small, then this is usually called by the stochastic nature of the input nuclear logs coupled with borehole rugosity perturbations. If large, the possibility of washouts and gas effects should be examined before evaluating the possibility of another mineral that is not included in the composition model.

If these explanations are not sufficient, then negative proportions of components have a role as a basic check on the validity of the model used for compositional analysis. As such, they are diagnostic errors with an information content to be used to guide the analysis to a better solution. The distinction between errors that are acceptable as minor, random measurement noise and systematic deviations is best made by a comparison between the original logs and the logs predicted by the model solution. The predictions are given by:

$$
\hat{L} = CV
$$

If the inverse procedure has generated zone solutions with proportions that are negative or exceed unity, then the adjustment to rational proportions will result in log predictions that will deviate from the original logs. The deviations between measurements and predictions can then be examined to differentiate minor measurement error from systematic perturbations that require intervention and correction. In the more sophisticated models to be reviewed, tool response errors are actively incorporated within the solution algorithm, together with constraints that preclude irrational compositional proportions.

However, if the solution results in compositional proportions that are all positive, then there will be an exact match between the logs and model predictions. This equivalence does not imply that the result is geologically correct; it simply means that the solution is rational and consistent with the choice of components and their properties. There may be other satisfactory solutions based on alternative mineral suites.

### **24.2 Mineralogy of Underdetermined Systems**

The basic compositional inversion procedure requires a precise match between the number of knowns and unknowns. This situation is a "determined system". The alternative possibilities are that the number of logs is insufficient to provide a unique resolution of the proportions of the components (an underdetermined system) or that the number of logs exceeds the number of components (an overdetermined system). In reality, it is likely that most formations present underdetermined compositional problems, if all the constituents are counted and matched against the number of logs run in a typical borehole. As counterpoint, many of the minerals will be found in small quantities and the overall composition dominated by a few components.

McCammon (1970) and Harris and McCammon (1971) considered alternative model procedures to the estimation of mineral compositions from logs in underdetermined cases. Although their algorithms have been superseded by optimization procedures, their approach is instructive concerning the role of information in log compositional analysis and the potentially competing criteria of mathematical optimality and geological reality. McCammon (1970) considered the underdetermined system. In terms of classical information theory, which proposes that the least biased solution is the one that maximizes the entropy function:

$$E = \sum p\_i \log p\_i$$

where *pi* is the proportion of the *i*th component. This equation for entropy is closely approximated by that for proportional variance:

$$\begin{aligned} P &= \left(\frac{n}{n-1}\right) \sum \nu\_i (1 - \nu\_i) \\ &= \left(\frac{n}{n-1}\right) \left(1 - \sum \nu\_i^2\right) \end{aligned}$$

The maximum of the variance function, *P*, is close to the condition of maximum entropy, and the resulting optimal solution is easier to compute using the matrix algebra equation:

$$V = C^\prime (CC^\prime)^{-1} L$$

where *V* is the vector of unknown proportions, *C* is the matrix of component log properties, *t* signifies a matrix transpose, and *L* is the vector of zone log responses (Doveton and Cable 1979).

The compositional solution from the proportional variance algorithm is optimal from a classical statistical viewpoint: the average squared errors between estimates and real compositions should be the minimum possible.

This is a conservative philosophy that aims to be least wrong or risk-averse with a minimum error as penalty. However, mineral proportions are frequently distributed in a highly unequal manner. Therefore the real rock composition will often be one of several extreme possibilities, rather than the less likely seemingly homogeneous composition that can result from a minimum variance solution. The correct interpretation of a bland compositional solution is that it represents the average of a range of possibilities. As such, it is a good estimate of the average, but may be a very poor prediction of the particular: the composition of the zone in question. Such a result is a useful diagnostic that suggests that several extreme alternatives should be reviewed and that extra information is required. The information can take a variety of forms, such as explicit geological knowledge of the range of actual compositions, or the use of additional constraints that preclude impossible solutions.

### **24.3 Mineralogy of Overdetermined Systems**

Many rocks are dominated by a relatively small number of components, so that the number of logging tool measurements may exceed the number of significant lithological components. The situation becomes overdetermined when the number of log response equations is greater than the number of components. The appropriate solution is then one that most accurately reproduces the original logs when logs are calculated as predictions from the compositional solutions. Using conventional statistical theory, this solution is the one that minimizes the sums of squares of the deviations between the original logs and their predictions. The least-squares solution is given readily by the matrix algebra equation:

$$V = \left(C^\prime C\right)^{-1} C^\prime L$$

where the terms are the same as those in both the determined and underdetermined matrix algorithms written earlier. The matrix formulation requires some additional weighting function to allow for the fact that the logging measurements are recorded in radically different units. Without any weighting, the error minimization is predicated on equal units and results in a solution which preferentially honors logs with the highest data ranges. The modified least-squares algorithm is then:

$$V = \left(\text{C}^\prime\text{WC}\right)^{-1}\text{C}^\prime\text{WL}$$

where *W* is a diagonal matrix that contains the elements of a weight vector (Harvey et al. 1990). The weights may be assigned based on physical first principles or by a standardization scheme, such as transformation from the original measurement to a scale anchored to the mean and counted in standard deviation units.

For any given zone, the sum of squares error is given by:

$$e = \left(L - \hat{L}\right)'\left(\left(L - \hat{L}\right)\right)'$$

where *L*̂is the vector of log responses associated with the least-squares solution. The error term can be plotted as a monitor log to highlight zones where there are striking inconsistencies between the model and the log responses. The overall performance of an algorithm may be judged from the standard error, computed from the summed zone errors as:

$$s\_e = \sqrt{\frac{\Sigma e}{(n-m-1)}}$$

where *n* is the number of observations and *m* is the number of logs.

### **24.4 Optimization Methods**

Current compositional analysis procedures has moved beyond simple inversion algorithms described, so that constraints and tool error functions have been incorporated as part of the solution process. The methodology was first developed by Mayer and Sibbit (1980) who applied modified steepest-descent strategies to hunt for an optimal solution that minimized the "incoherence" between the logs and their predicted values. For any given log, the incoherence function is given by:

$$I\_A = \frac{\left(a - \hat{a}\right)^2}{\left(\sigma\_A^2 + \sigma\_A^2\right)}$$

where *IA* is the incoherence for log *A*, a is the log response for the zone and *a*̂is its prediction, *σ*<sup>2</sup> *<sup>A</sup>* and *τ*<sup>2</sup> *<sup>A</sup>* are the uncertainties associated with the log measurement and the response equation, respectively.

The uncertainty term for each log measurement is compounded from the sources of sensor error, data acquisition, and the dispersions associated with environmental corrections. Response equation dispersion represents the uncertainties introduced by linear approximations, erroneous choices of component log responses, and hidden factors such as the influence of textural parameters. It seems reasonable to suppose that these two types of uncertainty are independent, so that they can be summed as one total error term for each tool:

$$
\mu\_A^2 = \sigma\_A^2 + \tau\_A^2
$$

The total log incoherence for any particular depth zone is the sum of the separate log incoherences:

$$I\_t = I\_A + I\_B + I\_C + \dotsb$$

The form of the equations shows that the solution will tend to be most strongly influenced by the logs to which the most confidence can be attributed. Logs with large errors will have greater incoherences and will contribute more to the total incoherence term.

Constraints are also included and take the general form of:

$$\mathbf{g}\_i(\nu\_i) \ge \mathbf{0}$$

where *gi* is some function that constrains the value of the unknown proportion of the *i*th component. Rigid, mathematical constraints are those that preclude the occurrence of proportions that are negative or those that exceed unity. Geological and local constraints incorporate relations that conform to general geological principles or prior knowledge of local geology. These geological constraints are more generalized, so that appropriate uncertainties are assigned to them. The constraint dispersions generate additional incoherence terms to be considered. A combined incoherence function is then the sum of the log and constraint incoherences:

$$I\_t = \sum \frac{\left(a\_i - \hat{a}\_i\right)^2}{\sigma\_i^2 + \tau\_i^2} + \sum \frac{g\_i \left(\nu\_j\right)^2}{\tau\_j^2}.$$

Notice that if the system is fully determined, then the total incoherence will be zero, provided that no constraints are violated. This special situation is the limiting case of applications which are otherwise presumed to be overdetermined. In a routine application of the optimization algorithm, the number of logs would be expected to exceed the number of components. In part, this is feasible because the bulk of rock compositions tend to be dominated by relatively few components. In addition, the range of wireline measurements used today typically extends beyond the traditional porosity logs to resistivity, spectral gamma ray and geochemical logs.

The optimization method of Mayer and Sibbit (1980) is an iterative search procedure. The system model of input logs and output components are first defined. The incoherence values associated with each log type are entered, together with the constraints to be met. For each zone, an initial composition is estimated by an approximate method and used as the starting point for a sequence of intermediate solutions. At each step, the incoherence is calculated between the input log responses and those predicted from the solution. A gradient is also computed as the means to generate the next solution, using a steepest descent technique. The process terminates when it is determined that convergence has been satisfied, at which time there is no appreciable difference between successive solutions. The final solution will be approximate, but the total incoherence between the logs and the compositional estimate will be the minimum possible. The combined display of real and theoretical logs is invaluable as a quality control mechanism to alert the user to problem zones which may be optimal, but are flatly wrong. The generality of the approach allows alternative and remedial attempts to be made without major difficulty.

In further refinements, Gysen et al. (1987) described an extension of the method to the simultaneous optimization of component proportions and response parameters. Moss and Harrison (1985) also reported a technique to solve for the uncertainty multipliers which contain the total error associated with each tool. Although the errors cannot be solved for every depth zone, they can at least be estimated for selected intervals and assumed to be effectively constant between zones.

Phyllosilicate minerals pose a difficult problem because their composition is so variable. However, the clay mineral properties listed provide a useful reference standard in the estimation of hypothetical composition volumes in the absence of explicit information keyed to the formation that is analyzed. The estimates can be considered as normative, as contrasted with modal predictions of clay mineral proportions based on X-ray diffraction analyses from core.

Optimal, minimum error solutions are worthless if the component model is incorrectly specified. Meaningful results are best obtained by patient geological evaluation of a sequence of solutions where the results of each are used to an improvement of the successive solution. Modern compositional analysis software utilizes the power of the error minimization method, but allows user interaction so that alternative geological models can be compared.

Quirein et al. (1986) described the use of quadratic programming techniques and linearized response equations, as an improvement on the penalty constraint approach used by earlier methods. In addition, they incorporated a program to solve for poorly known log responses of a component subset, as an optimization procedure applied to specific depths that could be used for calibration. These calibration intervals are those where both logs and compositions are known and are most typically those that have been cored. In addition, knowledge of composition could be utilized from other sources. Not all component log responses need to be estimated since their properties are restricted to a limited range. However, a subset of mineral components have ambiguous and locally variable properties. The most notorious example of such components are clay minerals, and these will be discussed more fully in the following section.

In common with earlier optimization methodologies, the system is assumed to be either determined or overdetermined. The use of multiple alternative models then allows a more realistic treatment of this assumption, in which common associations can be modeled in parallel and a final selection made between them at any depth. Wherever possible, each separate model is designed to be close to fully determined in an attempt to find a good match and to sidestep problems associated with the estimates of log and equation dispersions (Marett and Kimminau 1990). The appropriate logs for each model are clearly those that discriminate well between the separate components. If a poor choice of logs is made, then the model is ill-conditioned. The model structure can be checked through the computation of the condition number of:

#### *Ct DC*

where *C* is the matrix of component log responses and *D* is a matrix of uncertainty values. The condition number is higher for ill-conditioned models and gives a measure of the sensitivity of proportion estimates to small changes in component log responses (Quirein et al. 1986). The choice between alternative models for any zone can be made by the user based on an assessment of the relative incoherence of the solutions and their feasibility as reasonable geological descriptions. Alternatively, the decision can be made on the basis of probability established either from comparison of alternative solutions or the use of a Bayesian prior probability.

While generally still applied to an overdetermined system, the multiple models are not far removed from determined matches of components and logs. Where a model becomes determined, the solution is that of a simple and fast matrix inversion with zero incoherence, provided that the non-negative constraint is not violated. The analysis of the relative conditioning of the model system is a valuable mathematical contribution to the determination of which logs provide the maximum discrimination of model components that will lead to the most stable estimates of volumetric proportions.

### **24.5 Clay Component Estimation**

Shales are composed typically of a mixture of clay minerals, quartz, carbonates, and iron minerals, as well as other accessory components. Clay minerals are markedly different from other rock-forming minerals in terms both of their complexity and variability. Shales present special problems for log interpretation and while many algorithms have been designed for their volumetric estimation, the meaning and limitations of their results should be understood.

In more detailed work, the older and broader methods of shale evaluation have been expanded to the quantitative assessment of clay mineral species. Clay minerals show differing degrees of variability, but are generally subdivided between four major types: illite, smectite, kaolinite, and chlorite. Clay mineral typing is based on several log criteria which must be considered carefully and collectively. Ellis (1987, pp. 460–461) noted that the four principal clay mineral types could be combined into two types, based on their hydroxyl content. Kaolinite and chlorite have eight hydroxyls, as contrasted with four for smectite and illite. The neutron log is sensitive to this difference, which can be used as one diagnostic guide, through comparison of the neutron and density porosities when they are both scaled with respect to a quartz matrix. The photoelectric factor is also a useful clay discriminator because of its control by the aggregate atomic number. Ellis (1987, pp. 451–454) pointed out that iron-free aluminosilicate clays would have photoelectric absorption characteristics that are virtually the same as for quartz. Therefore, variations in the photoelectric factor within shales are primarily a reflection of iron content. Overall, there is a tendency for a progressive increase in iron from low values in kaolinite, through smectite and illite, to high values for iron-bearing chlorite. Distinctions between clay minerals can also be made on the basis of spectral gamma-ray logs, particularly in the differentiation of relatively potassium– rich illites from low-potassium kaolinite and chlorite.

The quantitative estimation of clay mineral abundances from the neutron, density, photoelectric factor, and spectral gamma ray measurements is fraught with difficulties. Wide compositional changes within clay mineral groups pose special problems. Useful quantitative models are not easy to define and are frequently ambiguous in their interpretation. The most realistic approach would be to coordinate log measurements with laboratory analyses of core samples. The core values may be idealized as a calibration standard in the development of a statistical prediction model for clay minerals from logs. Even this strategy must be considered thoughtfully and honestly. The most widely used laboratory method to estimate quantities of clay minerals is that of X-ray diffraction. Even with careful sample preparation procedures, the error of clay mineral estimates from X-ray diffraction can be routinely expected to be 50% or more of the reported value (Eslinger and Pevear 1988, p. A-24). Nevertheless, an important result is that at least the appropriate mineral subset can be identified with some confidence. This ensures that the correct components will be selected for compositional analysis from logs. Reconciliation of the log estimates with X-ray diffraction analyses should then be made within a model that attributes appropriate error magnitudes to both data sources.

### **24.6 Normative Estimation by Geochemical Logs**

Geochemical logging tools measure induced gamma-ray spectra that are created when a formation is bombarded by high energy neutrons from an electronic pulsed source. A matrix inversion spectral fit algorithm then separates the spectrum into individual elemental sources. The major rock composition elements of silicon, calcium, magnesium, iron, sulfur, titanium and carbon are estimated together with the rare earth, gadolinium. In addition, potassium, thorium, uranium can be estimated from the natural gamma rays emitted by formations and measured by the spectral gamma-ray log. As a consequence of the direct relationship between elemental data and mineral compositions more realistic mineral transforms have been developed that are a major improvement on models based on mineral properties. However, a distinction must be made between normative minerals that are computed from transforms of elemental data and modal minerals that are observed visually or by petrographic laboratory methods such as X-ray diffraction or infra-red spectroscopy. Clearly, the fundamental goal of an effective transform is to provide a close match between normative mineral solutions and modal mineral suites.

"Normative" minerals calculated from oxide analyses have been a standard procedure in igneous petrology since the CIPW (Cross-Iddings-Pirsson-Washington) norm was introduced by Cross et al. (1902). These normative minerals are contrasted with modal compositions that are commonly measured by point-counting of minerals in thin-sections of rock. The normative concept has also been extended to sedimentary rocks in attempts to compute realistic mineral assemblages. Krumbein and Pettijohn (1938) pp. 490–492 explained the molecular ratio method to calculate the probable mineral composition of a rock, based on chemical analyses of oxide percentages. As a first step, the minerals to be resolved are first identified from thin-section observation or other sources of information. The molecular ratios are then assigned in a stepwise fashion to the minerals. The process consists of a logical order of steps that first accommodates unique associations between oxides and certain minerals, and then allocates the remainder to other components. Imbrie and Poldervaart (1959) described a commonly used method of sedimentary normative analysis and then compared the results with modal estimates of mineralogy. From a detailed study of the Permian Florena Shale, they concluded that estimates of the chert, calcite, dolomite, and clay had errors of less than 5%. However, there was little agreement between computed clay mineral proportions and those produced from X-ray diffraction analysis. Imbrie and Poldervaart (1959) were not surprised by this discrepancy, but attributed it to the known high variability of clay mineral compositions through isomorphous substitution.

Essentially the same problems are tackled in the computation of sedimentary normative minerals, when based on elements measured by geochemical logs (Herron 1986). However, many of the older normative methods predated computers. The classical norm calculation is subtractive, deterministic and rigidly leveraged. As discussed by Harvey et al. (1990), the method can be useful when certain elements can be assigned totally to single individual minerals. These assignations can then be made in an ordered protocol of analysis partition between mineral species. Otherwise, the use of simultaneous equations to link mineral compositions with elemental measures is a much more general and powerful method. The speed of modern software also allows real-time interaction between petrophysicist and machine, so that alternative models can be evaluated quickly and decisions made that blend mathematical optimality with geological credibility. Any analysis should be preceded by some notion of what constitutes a fit-for-purpose estimation. Less accuracy is needed if the intent is for a generalized semi-quantitative description of variation rather than more rigorous estimates for use in quantitative basin modeling or physical property predictions (Harvey et al. 1998).

The model that links minerals with elements can be set up as a fully determined system and solved by standard matrix inversion using methods described earlier. Whenever the components are computed as positive proportions, then the compositional solution is rational and honors the analysis perfectly. However, in common with the normative model, any apparent precision read into the result is illusory because the determined system makes no allowance for analytical error. It is usually practical to model a rock with a set of minerals that are fewer in number than the elements available from geochemical logging. The system is then overdetermined and can be resolved by one or other of a variety of optimization techniques. The additional complexity in computation is offset by several distinct advantages. The overdetermination allows constraints and error functions to be incorporated, both for optimal solution control and diagnostic evaluation of sources of analytical error. The choice of an overdetermined system also provides better assurance of a stable solution in situations where the mineral response matrix becomes sparse or there are potential compositional colinearities that link some of the mineral subsets (Harvey et al. 1990).

Strictly speaking, there will almost always be more minerals than elements to solve for them, so that the problem is always underdetermined. However, as Herron (1988) noted, the overwhelming majority of sedimentary rocks are composed of only ten minerals: quartz, four clays, three feldspars, and two carbonates. In practice, reasonable compositional solutions can be generated using relatively small mineral sub-sets, provided that they have been identified correctly and that the compositions used are both fairly accurate and constant. Alternatively, the inversion procedure can be run as an unconstrained procedure and components with negative proportions eliminated from the model. Harvey et al. (1998) found this approach to be successful, but cautioned that negative components should be eliminated one at a time, starting with the largest negative component, because of interactions between the components.

Mineral solutions may be calculated by two alternative strategies. In the first, the average chemical compositions of minerals drawn from a large data-base are used as endmember responses and resolved by standard matrix inversion procedures. This result is normative and generic in the sense that it is based on a sample drawn from a universal mineral reference set and applied to a specific sequence where local mineral compositions may deviate from the global average. The result is hypothetical, but has the particular advantage that comparisons can be made between a variety of locations and do not require expensive ancillary core measurements. New methods of classification may also be necessary as discussed by Herron (1988) in his study of terrigenous sands and shales in terms both of core and geochemical log data.

In a second approach, the solution is calibrated to core data, where laboratory determinations of mineralogy and elemental geochemistry are analyzed by multiple regression techniques to determine local mineral compositions. This result is linked to petrography and so is philosophically closer to an estimated modal solution, rather than the more hypothetical normative model. As mentioned earlier, realistic statistical calibration models should incorporate error terms from all sources of measurement. When geochemical logging was first introduced, several detailed studies were made to assess the strengths and limitations of borehole geochemistry through exhaustive comparisons with core elemental and mineralogical analyses. These included comparisons in the Conoco Research well, Ponca City, Oklahoma by Hertzog et al. (1987); the discussion of the results from an Exxon research well which penetrated Upper Cretaceous siliciclastic rocks in Utah by Wendlandt and Bhuyan (1990); and an assessment of data from three Shell wells in the Netherlands, Oman, and the U.S. by van den Oord (1990).

There are several ways to assess modal mineralogy, so which constitutes the most accurate method to use as a standard for the real mineral composition? Harvey et al. (1998) addressed this problem when they compared core data from the spectral measurements of quantitative X-ray diffraction and infrared spectroscopy, as well as micrometric analysis from thin section point counts. Overlapping peaks and poor resolution at low resolution pose special problems for the spectral methods, while appropriate sample sizes must be observed for robust statistics in micrometric analysis. Also, the distinction between volume percentage and weight percentage must be observed when interrelating modal and normative compositions. Harvey et al. (1998) concluded that the results of their study did not favor one method over another, but pointed out that their comprehensive analysis demonstrated the difficulty of obtaining accurate modal estimates and even the notion of what constitutes the "real" mineral composition. This is certainly worth bearing in mind when making a judgement about the "accuracy" of a normative mineral solution from inversion of log responses. So, for example, mismatches in clay mineral estimates by log inversion represents a failure to reproduce the results of quantitative X-ray diffraction which are themselves only estimates of the true composition.

A major obstacle to the production of unique mineral transformations from element concentrations has been the problem of compositional colinearity. If precisely colinear, then an infinite range of solutions is possible, causing a matrix singularity and a breakdown of an inversion procedure. If average mineral compositions are used, a solution becomes possible, but may be unstable (Harvey et al. 1998). Wendlandt and Bhuyan (1990) found that the use of silicon, potassium and aluminum tended to result in overestimates of kaolinite; the use of iron to predict illite content caused underestimates of kaolinite. However, effective discrimination between illite and kaolinite contents became possible when dry density was applied as an extra constraint.

### **24.7 Conclusion**

The estimation of mineral composition from petrophysical logs is now a standard feature on any log analysis software package. However, the degree to which these estimates match reality is highly variable and requires a knowledgeable and experienced user to work with powerful procedures. The identification of the major mineral suite that actually occurs in the rock is an important first step. As the old Chinese proverb says, "The beginning of wisdom is calling a thing by its right name." In the end, the solution of "mathematical minerals" will often come down to a choice between an acceptable estimate of an unreachable modal mineralogy or the realization of a useful, but hypothetical, normative assemblage.

### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **Chapter 25 Geostatistics for Seismic Characterization of Oil Reservoirs**

**Amílcar Soares and Leonardo Azevedo**

**Abstract** In the oil industry, exploratory targets tend to be increasingly complex and located deeper and deeper offshore. The usual absence of well data and the increase in the quality of the geophysical data, verified in the last decades, make these data unavoidable for the practice of oil reservoir modeling and characterization. In fact the integration of geophysical data in the characterization of the subsurface petrophysical variables has been a priority target for geoscientists. Geostatistics has been a key discipline to provide a theoretical framework and corresponding practical tools to incorporate as much as possible different types of data for reservoir modeling and characterization, in particular the integration of well-log and seismic reflection data. Geostatistical seismic inversion techniques have been shown to be quite important and efficient tools to integrate simultaneously seismic reflection and well-log data for predicting and characterizing the subsurface lithofacies, and its petro-elastic properties, in hydrocarbon reservoirs. The first part of this chapter presents the state of the art and the most recent advances of geostatistical seismic inversion methods, to evaluate the reservoir properties through the acoustic, elastic and AVA seismic inversion methods with real case applications examples. In the second part we present a methodology based on seismic inversion to assess uncertainty and risk at early stages of exploration, characterized by the absence of well data for the entire region of interest. The concept of analog data is used to generate scenarios about the morphology of the geological units, distribution of acoustic properties and their spatial continuity. A real case study illustrates the this approach.

A. Soares (✉) <sup>⋅</sup> L. Azevedo

CERENA, Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais, 1049-001 Lisbon, Portugal e-mail: asoares@tecnico.ulisboa.pt

L. Azevedo e-mail: leonardo.azevedo@tecnico.ulisboa.pt

© The Author(s) 2018 B. S. Daya Sagar et al. (eds.), *Handbook of Mathematical Geosciences*, https://doi.org/10.1007/978-3-319-78999-6\_25

### **25.1 Integration of Geophysical Data for Reservoir Modeling and Characterization**

One of the main challenges regarding hydrocarbon reservoir characterization has been the integration of different types of data—geological conceptual models, well-log data, geophysical data, production data—for modelling the subsurface properties of interest while assessing the corresponding uncertainty and risk. Although well data provides certain 'hard' measures of the subsurface properties, given the usual lack of such data and, consequently, its limited spatial representativeness, the corresponding models normally provide little understanding of the complex and heterogeneous subsurface geology of the entire reservoir area. Since the eighties, Geostatistics has been a key discipline to provide a theoretical framework and corresponding practical tools to incorporate as much as possible different types of data for reservoir modeling and characterization, in particular the seismic reflection data (Dubrule 2003). One of the most important contributions of geostatistical methods for seismic data integration in reservoir modelling, has been the development of stochastic seismic inversion techniques.

Seismic reflection data, since it has high spatial representativeness, by covering the full spatial extent of the reservoir volume, is a different and privileged window for targeting the subsurface petro-elastic properties of interest. However, seismic reflection data represents an indirect measurement of these properties and has a poor spatial resolution along the vertical direction (temporal domain). This is translated in a much greater support compared with the well-log data and much greater uncertainty derived both from measurement errors and the nonlinear relationship between the recorded seismic signal and the subsurface properties one wishes to describe (Tarantola 2005). This has been the most serious limitation of direct use of seismic data as secondary information either in methods using it as local trends or in joint simulation methods (Dubrule 2003), or even accounting for the different support of both data (Liu and Journel 2009).

To overcome such limitations, an alternative approach has been widely used. Seismic inversion methods are based on the following rational: subsurface petrophysical properties (such as facies, porosity and saturation), can have a relationship to other seismic attributes, such as acoustic and/or elastic impedances; hence, one wishes to know the model parameters **r** (reflectivity coefficients derived from the subsurface elastic properties), which convolved with a known wavelet **w** give rise to the known solution **A** (i.e. the recorded seismic amplitudes):

$$\mathbf{A} = \mathbf{r}^\* \mathbf{w}.\tag{25.1}$$

The theoretical solutions for seismic inversion are stated in Tarantola (2005). The seismic inversion problem began to be tackled with deterministic methodologies (Lindseth 1979; Lancaster and Whitcombe 2000; Russell 1988; Coléou et al. 2005). Later, this framework was extended into a statistical domain. Among the many statistical inverse approaches, two different stochastic approaches for solving the seismic inversion are worth mentioning. The first group of stochastic methodologies approach the seismic inversion as an optimization problem in an iterative and convergent process. This includes what are traditionally designated by iterative geostatistical seismic inversion methods, from the seminal work by Bortolli et al. (1993), until the most recent geostatistical inversion methods (Soares et al. 2007; Nunes et al. 2012; Azevedo et al. 2015; Azevedo and Soares 2017). The second group of stochastic seismic inversion algorithms is known by linearized Bayesian inverse methodologies. These are based on a particular solution of the inverse problem using the Bayesian framework and assuming the model parameters and observations as multi-Gaussian distributed as well as the data error, which allows the forward model to be linearized (Buland and Omre 2003). Several authors have recently contributed towards overcoming some of the limitations of this method, particularly the multi-Gaussian assumption, by using Gaussian Mixture Models (Grana and Della Rossa 2010).

This chapter summarizes some iterative geostatistical modeling techniques dealing with the integration of seismic reflection and well-log data, through seismic inversion procedures, for characterizing hydrocarbon reservoirs with high spatial resolution models of main properties of interest, such as lithologies, facies and fluid saturations.

Uncertainty and risk assessment at different stages of exploration are also important targets of the proposed methodologies approached in this chapter. Hence, this chapter finishes with the introduction of recent advances of geostatistical seismic inversion methods for the uncertainty and risk assessment at early stages of exploration.

### **25.2 Iterative Geostatistical Seismic Inversion Methodologies**

The aim of seismic inversion is the inference of the subsurface elastic or acoustic properties from recorded seismic reflection data. The retrieved inverse models can be acoustic and/or elastic impedance for post-stack seismic data, or density, P-wave and S-wave models if the inversion algorithm is used to invert pre-stack seismic reflection data (Francis 2006).

Seismic inversion might be described as an ill-posed and nonlinear problem with multiple solutions that can be summarized by (Tarantola 2005):

$$\mathbf{d}\_{\rm obs} = \mathbf{F}(\mathbf{m}) + \mathbf{e}.\tag{25.2}$$

The goal is to estimate a subsurface Earth model, **m**, that after being forward modelled, **F**, produces synthetic seismic data showing a good correlation with the recorded seismic data, the observed data, **dobs**, which are normally contaminated by measurement errors **e**. The match between observed and synthetic seismic is achieved by the maximization (or minimization) of an objective function measuring the mismatch between inverted and real seismic. For example, the objective function can be as simple as the Pearson's correlation coefficient:

$$\rho\_{X,Y} = \frac{cov(X,Y)}{\sigma\_X \sigma\_Y},\tag{25.3}$$

where *cov* is the centered covariance between variables *X* and *Y*, which are the synthetic and real seismic volumes, respectively, and σ the individual standard deviations of each variable. More complex objective functions integrate Pearson'<sup>s</sup> correlation coefficient with least-square errors calculated between the synthetic and the recorded seismic reflection data in terms of amplitudes.

A geostatistical seismic inversion framework consists on an iterative procedure in which a set of realizations of parameters, **m**, are generated by using stochastic sequential simulation methods (Deutsch and Journel 1996) and optimized until the match of the objective function reaches a given user-defined value, or a certain number of fixed iterations. Geostatistical inversion techniques are based on the use of stochastic sequential simulation as the model perturbation technique, ensuring in this way the reproduction of the main spatial continuity patterns and the joint distribution functions of the acoustic and/or elastic properties of interest as retrieved from the existing well-log data in all the models generated during the iterative procedure, while simultaneously allowing access to the uncertainty attached to the retrieved inverse models.

Within this framework there are two traditional approaches for integrating seismic reflection and well-log data for hydrocarbon reservoir modeling.

### **25.3 Trace-by-Trace Geostatistical Seismic Inversion**

Geostatistical seismic inversion was introduced by the seminal papers of Bortoli et al. (1993) and Haas and Dubrule (1994). These authors proposed a sequential trace-by-trace approach in which each seismic trace, or location within the inversion grid, is visited individually following a pre-defined random path within the seismic volume. At each step along the random path a set of *Ns* realizations of one acoustic impedance trace is simulated using sequential Gaussian simulation (Gómez-Hernández and Journel 1993; Deutsch and Journel 1996), taking the well-log data and previously visited/simulated nodes into account. Then, for each individual simulated impedance trace, the corresponding reflection coefficient is derived and convolved by a wavelet, resulting in a set of *Ns* synthetic seismic traces. Each of the *Ns* synthetic traces is compared in terms of a mismatch function with the recorded/ real seismic trace. The acoustic impedance realization that produces the best match between the real and the synthetic seismic traces is retained in the reservoir grid as conditioning data for the simulation of the next acoustic impedance trace at the new location following the pre-defined random path. One of the main drawbacks of trace-by-trace stochastic seismic inversion methodologies concerns those areas of the record seismic reflection data with low signal-to-noise ratio. In areas of poor seismic signal, the sequential trace-by-trace approaches impose inverted models fitting the observed noisy seismic reflection data. As the simulated trace is assumed to be 'real' data for subsequent steps, this can lead to the spread of unreliable impedance values that are related with noisy seismic samples. Noisy areas should be interpreted as high uncertainty areas with very low influence throughout the inversion process. More recent versions of trace-by-trace models try to overcome this drawback by avoiding noisy areas in the early stages of the inversion procedure (Grijalba-Cuenca and Torres-Verdín 2000).

### **25.4 Global Geostatistical Seismic Inversion Methodologies**

To overcome these limitations, Soares et al. (2007) introduced the global stochastic inversion methodology that, contrary to trace-by-trace approaches, uses a global approach during the stochastic sequential simulation stage of the inversion procedure: at each iteration a set of *Ns* impedance models is generated at once for the entire inversion grid. The general outline of this family of geostatistical inversion algorithms is depicted in Fig. 25.1. Briefly, this group of iterative inverse approaches uses the principle of cross-over genetic algorithms as the global optimization technique driving the convergence of the procedure from iteration to iteration, while the model perturbation is performed using direct sequential simulation and co-simulation (Soares 2001). The global optimizer uses the trace-by-trace correlation coefficients between the different simulated synthetic seismic data and the real model as the affinity criterion to create the next generation of models for the next iteration, by using stochastic sequential co-simulation. The iterative procedure continues until a stopping criterion is reached: frequently the global correlation coefficient between real and inverted seismic reflection data.

In global iterative geostatistical seismic inversion procedures, areas of low signal-to-noise ratio remain poorly matched throughout the entire iterative inversion

**Fig. 25.1** General outline for global iterative geostatistical seismic inversion

procedure: an ensemble of best-fit inverted models will always present high variability, or high uncertainty, for those noisy areas where the signal-to-noise ratio is low.

This framework was generalized for the inversion of seismic reflection data for acoustic and elastic impedance, direct inversion of petrophysical properties and seismic AVA inversion. These methods are introduced with more detail in the following sections.

## *25.4.1 Global Geostatistical Acoustic Inversion*

The global stochastic inversion (GSI; Soares et al. 2007; Caetano 2009) is one of the existing methods to invert fullstack seismic reflection data for acoustic impedance (Ip) models. The general outline of this iterative geostatistical methodology can be described in the following sequence of steps, summarized in Fig. 25.2:

**Fig. 25.2** Outline of geostatistical acoustic inversion (adapted from Azevedo and Soares 2017)

#### 25 Geostatistics for Seismic Characterization of Oil Reservoirs 489


$$RC = \frac{Ip\_2 - Ip\_1}{Ip\_2 + Ip\_1},\tag{25.4}$$

where the indexes 1 and 2 correspond to the layer above and below a given reflection interface.


Synthetic and real case applications of geostatistical acoustic inversion can be found in several studies; for example, Soares et al. (2007) and Caetano (2009). A summary of a real application example, using a fullstack seismic volume acquired offshore Brazil, illustrates herein the method (a detailed description of the dataset is available in Azevedo et al. 2015). The best-fit Ip model (Fig. 25.3) was retrieved after 6 iterations where on each iteration an ensemble of 32 realizations of Ip were generated. The use of stochastic seismic inversion allows retrieving high resolution (with high variability) acoustic impedance models. The synthetic fullstack seismic data computed from this model (Fig. 25.4) do match the observed seismic reflection data in both the spatial extent of the main seismic reflection and its amplitude content. This is of great importance for this case study since the reservoir areas are related with those spatially constrained amplitude anomalies

**Fig. 25.3** Vertical well-section extracted from the best-fit P-impedance volume retrieved from the global stochastic inversion after six iteration with thirty-two realizations generated at each iteration

**Fig. 25.4** Comparison between vertical well sections extracted from: **a** synthetic seismic reflection data computed from the best-fit inverse Ip model shown in Fig. 25.3 and **b** real seismic volume. The log curve plotted on top of the seismic data represents Ip (same color scale as shown in Fig. 25.3)

observed in the real seismic volume. The global correlation between the inverted and the real seismic volumes is 87%.

## *25.4.2 Global Geostatistical Elastic Inversion*

The acoustic inversion algorithm was extended for the inversion of partial angle stacks directly, and simultaneously, for acoustic and elastic impedance (Is) models (Nunes et al. 2012; Azevedo et al. 2013b). The main purpose of this development was the integration of more information, related with the elastic domain (Is), to enrich the final elastic reservoir models allowing better lithofacies prediction. Two main differences compared with acoustic inversion summarize this elastic inversion method (Azevedo and Soares 2017):


$$\begin{aligned} R\_{pp}(\theta) & \approx (1 + \tan^2 \theta) \frac{\Delta I\_p}{\Delta I\_s} - 4 \left(\frac{I\_s}{I\_p}\right)^2 \sin^2 \theta \frac{\Delta I\_s}{2I\_s}, \\ \Delta I\_p &= I\_{p2} - I\_{p1}, \\ I\_p &= \frac{I\_{p2} + I\_{p1}}{2}, \\ \Delta I\_s &= I\_{s2} - I\_{s1}, \\ I\_s &= \frac{I\_{s2} + I\_{s1}}{2}. \end{aligned} \tag{25.5}$$

The index 1 refers to the vertical location in which the calculation of the reflection coefficient is carried out, the layer above the reflection interface; and 2 refers to the sample immediately below, the layer below the reflection interface.

Detailed application examples of this method can be found in the following studies: Nunes et al. (2012), Azevedo et al. (2013b), Azevedo and Soares (2017). For illustrative purpose, here we show the application of this methodology to the same case study shown in the previous section. The best-fit Ip and Is models that jointly produce the highest value of correlation coefficient between synthetic and real seismic reflection data are shown in Fig. 25.5. Comparing the Ip models derived from the acoustic and elastic inversion it is clear that the introduction of more information using different angles of incidence brings more detail for the retrieved inverse model. The comparison between real and synthetic seismic reflection data derived from the best-fit elastic models is shown in Fig. 25.6.

Due to the use of direct sequential simulation with joint probability distributions (Horta and Soares 2010) the relationship between Ip and Is as observed in the well-logs is reproduced for all pairs of models generated during the inversion procedure (Fig. 25.7). Besides the richness of the inverted models, this is a key step of the proposed inversion technique since it allows, for example, more reliable facies classification, and consequently a better reservoir description, over the inverted elastic models.

**Fig. 25.5** Comparison between vertical well sections extracted from: **a** best-fit Ip model and **b** best-fit Is model

**Fig. 25.6** Comparison between vertical well sections extracted from: (left) synthetic seismic reflection data computed from the best-fit inverse Ip and Is models and (right) real seismic volume. From top to bottom: nearstack, near-mid stack, far-mid stack and farstack. The log curve plotted on top of the seismic data represents Is (same color scale as shown in Fig. 25.5)

**Fig. 25.7** Comparison between the joint distribution of Ip and Is as retrieved from the best-fit inverse pair of Ip and Is and from the well-logs

### *25.4.3 Geostatistical Seismic AVA Inversion (Pre-stack Inversion)*

During the last decades, the quality of seismic reflection data has increased tremendously, together with the decreasing of its acquisition costs. Pre-stack seismic data with high signal-to-noise ratio and high fold number is nowadays a reality, increasing this data's use in seismic reservoir characterization even within early exploratory stages. The better subsurface characterization using pre-stack seismic data is achieved by interpreting the changes of amplitude versus the offset (AVO), or with the angle of incidence (AVA; Castagna and Backus 1993; Avseth et al. 2005). The use of pre-stack seismic reflection data allows the inference of density, P-wave and S-wave velocity models, instead of the traditional impedance models. The availability of the three properties individually is a clear enhancement in what reservoir modelling and characterization are concern with.

Stochastic seismic inversion methodologies for pre-stack seismic data, commonly called seismic AVA inversion, are being proposed based on different assumptions and frameworks (Mallick 1995; Ma 2002; Buland and Omre 2003; Contreras et al. 2005). Here we refer to geostatistical seismic AVA inversion (Azevedo et al. 2013a), which relies on the same general framework of global iterative geostatistical seismic inversion methodologies but with the following main characteristics of pre-stack inversion (see outline of Fig. 25.8; Azevedo and Soares 2017):

**Fig. 25.8** Schematic representation of the global iterative geostatistical seismic AVO inversion methodology (adapted from Azevedo and Soares 2017)


In this approach, each elastic property is generated sequentially. Density is first simulated because it is the property associated with a higher degree of uncertainty since its contribution to the recorded seismic reflection data is small, i.e. the component of the seismic reflection data related with density is low and mostly related to the signal received at the far angles (Avseth et al. 2005). Also, density is the most spatially homogeneous variable and consequently most convenient to be used as secondary variable for the co-simulation with joint probability distributions of Vp. The resulting Vp models are then used as auxiliary variable for the co-simulation with joint probability distributions of Vs. At the end of the iterative inversion procedure, the reproduction of the joint distribution densities, Vp and Vs, allows a distinction to be made between any litho-fluid facies previously identified from the original well-log data within the inverted set of elastic models. As well as the spatial interpretation of these litho-fluid facies, the stochastic approach allows the assessment of the spatial uncertainty related with each facies of interest.

After the sequential simulation of *Ns* elastic models, density, Vp and Vs, an ensemble of synthetic pre-stack seismic volumes are calculated. The angle-dependent RC (*Rpp*ð Þ*<sup>θ</sup>* ) may be calculated, for example, following Shuey'<sup>s</sup> (1985) three-term approximation:

$$R\_{pp}(\theta) \approx R(0) + G\sin^2\theta + F\left(\tan^2\theta - \sin^2\theta\right),\tag{25.6}$$

with the normal incidence, R(0), reflection as defined by:

$$\mathcal{R}(0) = \frac{1}{2} \left( \frac{\Delta V p}{V p} + \frac{\Delta \rho}{\rho} \right),$$

and the variation of the reflectivity versus the angle, the AVO gradient, G:

$$G = R(0) - \frac{\Delta V \rho}{V \rho} \left(\frac{1}{2} + \frac{2 \Delta V \mathbf{s}^2}{V \mathbf{s}^2}\right) - \frac{4 \Delta V \mathbf{s}^2}{V p^2} \frac{\Delta V \mathbf{s}}{V \mathbf{s}},$$

and *F*, the reflectivity at the far angles (reflection angles higher than 30°), defined as:

$$F = \frac{1}{2} \frac{\Delta V p}{V p} \dots$$

Each elastic property is defined on each side of the interface where the reflection is happening as follows:

$$\begin{aligned} \Delta V\_p &= V\_{p2} - V\_{p1}, \\ V\_p &= \frac{V\_{p2} + V\_{p1}}{2}, \\ \Delta V\_s &= V\_{s2} - V\_{s1}, \\ V\_s &= \frac{V\_{s2} + V\_{s1}}{2}, \\ \Delta V\_\rho &= V\_{\rho 2} - V\_{\rho 1}, \\ V\_\rho &= \frac{V\_{\rho 2} + V\_{\rho 1}}{2}. \end{aligned}$$

Indexes 1 and 2 have the same meaning as in Eq. 25.4.

Each angle gather is composed by *n* seismic traces, equal to the number of reflection angles considered. The *Ns* angle-dependent reflection coefficient traces are convolved by estimated angle-dependent wavelets for each particular incident angle θ (Fig. 25.9) to obtain *Ns* synthetic angle gathers. The best elastic models, created at the end of each iteration, are composed by the portions of the elastic traces from the ensemble of density, P-wave and S-wave velocity models simulated at the current iteration, that jointly produce synthetic seismic reflection data with the highest correlation coefficient compared with the real seismic volume. Hence, the

**Fig. 25.9** Example of an angle-dependent wavelet, for 23 angles, used for the convolution of the angle-dependent reflection coefficients (*Rpp*ð Þ*<sup>θ</sup>* ) to generate pre-stack seismic reflection data

best models are selected by using a multivariate (traces for each angle) objective function (Azevedo and Soares 2017 illustrate an example of multivariate objective function).

As an application example, Fig. 25.10 shows vertical well sections extracted from the triplet of elastic models that produced synthetic pre-stack seismic reflection data with the maximum correlation coefficient during the iterative procedure. The inverted density, Vp and Vs models show high variability and agree with the expected spatial extent of the anomalies of interest as inferred from previous studies (Azevedo et al. 2015).

By comparing the inverse elastic inversion, shown in the previous sections for the different geostatistical seismic inversion techniques (Figs. 25.3, 25.5 and 25.10) it is clear that introducing more information within the inversion procedure, i.e. moving from the fullstack into the pre-stack domain, allows retrieving more detailed and variable inverse models. Usually, such models allow for a better understanding of the reservoir and identify and assess the main uncertainties related with its subsurface properties.

### *25.4.4 Recent Developments of Iterative Geostatistical Seismic Inversion*

The global iterative geostatistical inversion techniques presented in the previous sections have been extended to allow inferring the subsurface petrophysical

**Fig. 25.10** Vertical well section extracted from the best-fit models of: (from top to bottom) density, Vp and Vs

properties of interest, directly from the existing seismic reflection data: direct geostatistical seismic inversion to porosity (Azevedo and Soares 2017); and integration of rock physics into geostatistical seismic AVA inversion for simultaneous characterization of facies (Azevedo et al. 2015). In addition, the potentiality of these methodologies is enormous in what concerns the very different data integration like for example the electromagnetic data (CSEM). Application example of the joint inversion of seismic and electromagnetic data is illustrated in the study of Azevedo and Soares (2014).

The integration of dynamic production data with seismic data is another important and very promising field of application of these methodologies. In fact the integration of dynamic production data in reservoir modelling (commonly designated as history matching) is an even more complex inverse problem (e.g. Oliver and Chen 2011; Oliver et al. 2008; Mata-Lima 2008; Demyanov et al. 2011; Caeiro et al. 2015). If this is approached by a geostatistical iterative outline, the integration of both inverse methods can lead to a very rich model able to characterize geological complex structures and, simultaneously, reproduce the geological conceptual model, the seismic data and the dynamic data at the production wells (Marques et al. 2015; Azevedo and Soares 2017).

### **25.5 Uncertainty and Risk Assessment at Early Stages of Exploration**

This section introduces a recent development of using seismic inversion for uncertainty and risk assessment at early stages of reservoir exploration characterized by the lack of well data. The idea of the proposed methodology is to account with the concept of geological analog data to define possible geological models of a given target, such as the geometry of different geological units, and also the a priori probability distributions for the elastic property of interest. An a priori uncertainty space is first built from plausible geological scenarios, generated from different sources of knowledge about the area of interest. For each scenario the corresponding elastic properties are computed and existing seismic reflection data is integrated, through a geostatistical seismic inversion, giving rise to an uncertainty space of petro-elastic properties. The first steps towards this direction correspond to the case study presented below.

### *25.5.1 Characterization of Different Scenarios with Analogue Data*

Due to the lack of data, several authors use analog data to constrain and integrate regional geological knowledge into reservoir models (e.g. Martinius et al. 2014; Grammer et al. 2004). The use of analog fields, and/or sedimentary basins, can help understand and predict the behavior of a reservoir since they are natural systems that may have similarity with the unknown study area. For example, one of the most valuable information that analogs can give to reservoir modelling, normally obtained from outcrop studies (Howell et al. 2014), is related to the geometry and the relation between the different geological units and their elastic properties.

This section proposes the extension of a traditional geostatistical seismic inversion methodology to integrate data from analogs (Pereira et al. 2017). In this application example the analog information is provided by well-logs located very far from the exploration area but somehow geologically related with the area of study. This iterative geostatistical seismic inversion methodology integrates a priori knowledge from the regional geology and the information from analogs, such as existing well-logs far from the region of interest (illustrated in Fig. 25.11).

One of the mandatories steps of this procedure, consists in dividing the area of interest in regional geological units based on conventional seismic interpretation

**Fig. 25.11** Schematic representation of the workflow to integrate geological analogue data into geostatistical seismic inversion, for each scenario

and the current knowledge of the prospect under study. The interpretation of the available seismic reflection data should be such that the interpreted seismic units are consistent with the stratigraphy of the region. The geological regionalization model of the area of study should be based not only on available seismic reflection data but include information from outcrop analogs or based on the geological knowledge of the sedimentary basin.

After the definition of the geological regionalization model, one needs to establish different scenarios, for each geological unit, about its elastic responses. These can be inferred from for example analogue data. This critical step should be done by integrating expertizes from different fields. The correlation between the elastic and rock properties should result in probability distribution functions of the elastic property of interest per region. The resulting distributions should be representative of the elastic properties of the geological region, and also of the relationship between the different geological regions. Meaning that if there is a progressive transition between geological regions (i.e. geological transition in terms of facies), this relationship should be expressed in the distributions of each region.

This approach is illustrated here with a real case study located in an offshore unexplored basin. The available data of this basin comprises a 3D seismic reflection and three appraisal wells drilled outside the main region of interest. The existing appraisal wells show evidences that suggest hydrocarbon generation, migration and possibly accumulation. Within this unexplored basin a promising prospect was

**Fig. 25.12** Real Seismic data for the area of interest showing the seismic signature of the prospect of interest. Lighter values indicate positive polarity and darker values indicate negative polarity

identified associated with a turbidite system, corresponding to a classic clastic sedimentary unit. This can be recognized and interpreted from the available seismic reflection data (Fig. 25.12). A detailed description about the geology of this basin can be found in Pereira et al. (2017).

The interpretation of the existing seismic reflection data resulted in three main geological units. For each region, probability distribution functions of Ip were assumed, taking into account the geological knowledge of the region of interest and from the Ip-logs available at the three neighbor wells. A representative wavelet of the time interval of interest was extracted exclusively from the available seismic reflection data using conventional wavelet extraction techniques based on statistical procedures (i.e. Weiner-Levinson filters). One of the main difficult steps of this methodology is the validation of the wavelet scale. A possible approach to tackle this issue can be selecting the distribution function of Ip for each region, making them plausible, by comparing the amplitude values of the synthetic seismic against the observed one.

## *25.5.2 Geostatistical Seismic Inversion of Each Scenario*

The previous step of this approach results in a set of geological models that represent the uncertainty about the prospect to be modelled. In order to reduce this space, the purpose of this step is based in the following rationale:

(i) for each one of the a priori chosen scenarios, in terms of geological regionalization model, one intends to access the models of acoustic and/or petrophysical properties, that match the known seismic, by running a conventional iterative geostatistical seismic inversion;

(ii) The match of each scenario synthetic seismogram with the real seismic can be used to validate or falsify them and build an uncertainty space of those properties.

Here, we show an example for one of the scenarios considered. The iterative geostatistical seismic inversion ran with six iterations, where on each sets of thirty-two realizations of Ip were generated conditioned simultaneously by the regionalization model (i.e. the three main seismic units resulting from seismic interpretation (Fig. 25.12)) and the individual Ip distributions as inferred from the nearby analog wells and published data.

The seismic inversion converged after six iterations when a global correlation coefficient between real seismic and synthetic seismic reflection data reached 85%. For region 1, the overburden region the correlation coefficient was 80%; for region 2, the potential reservoir region the correlation coefficient was 89% and for region 3, the underburden region the correlation coefficient was 70%. The synthetic seismic data was able to reproduce the real observed seismic reflection data in terms of the location and spatial distribution of the main geological features of interest.

The best-fit inverse Ip model (Fig. 25.13) allows the interpretation of the turbidite feature of interest in both vertical and horizontal slices. It also shows a reasonable spatial continuity pattern where it is possible to identify both large and subtle features of potential interest when appraising an unexplored sedimentary basin. Moreover it is clear that each region of the inversion grid is constrained individually by a given distribution function of Ip values. In this way we are constraining the spatial distribution of the simulated values. Since the regionalization of the area of interest is done using a geological criterion, the resulting best-fit inverse models are therefore geological consistent with the geological knowledge.

Uncertainty and risk of this unexplored area could be accessed by doing identical exercise but for different scenarios regarding the geometry of different geological units (regions) and, as well as, the Ip distributions for each one of them.

**Fig. 25.13** Best-fit inverse model of Ip retrieved after 6 iterations (left). It is possible to identify the turbidite system of interest corresponding to lower acoustic impedance values (purple). At right is the distribution function of the Best-fit inverse model of Ip, which reproduces the initial distribution function of Ip

### **25.6 Final Remarks**

This chapter presents the state of the art and the most recent advances in geostatistical seismic inversion. The promising results of presented and also referenced case studies clearly show an evident maturity of these methods as privileged instruments for the integration of different types of data, particularly seismic reflection data, for the characterization and modeling of hydrocarbon reservoirs.

Very recent studies, regarding the integration of electromagnetic data and production data, show the inversion methodologies as important new paths on geostatistical tools for modelling complex geological structures.

The methodology introduced for the characterization of uncertainty and risk in early stages of exploration integrates two important components: (i) the use of analog data to generate scenarios of uncertainty regarding the morphology of geological units and the distribution of acoustic and petrophysical properties; (ii) the stochastic inversion methodologies evaluate the most probable images within each scenario and also validate (or falsify) these scenarios regarding the known seismic reality.

### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. The images or other third party material in this chapter are included in the chapter's Creative

Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **Chapter 26 Statistical Modeling of Regional and Worldwide Size-Frequency Distributions of Metal Deposits**

**Frits Agterberg**

**Abstract** Publicly available large metal deposit size data bases allow new kinds of statistical modeling of regional and worldwide metal resources. The two models most frequently used are lognormal size-grade and Pareto upper tail modeling. These two approaches can be combined with one another in applications of the Pareto-lognormal size-frequency distribution model. The six metals considered in this chapter are copper, zinc, lead, nickel, molybdenum and silver. The worldwide metal size-frequency distributions for these metals are similar indicating that a central, basic lognormal distribution is flanked by two Pareto distributions from which it is separated by upper and lower tail bridge functions. The lower tail Pareto distribution shows an excess of small deposits which are not economically important. Number frequencies of the upper tail Pareto are mostly less than those of the basic lognormal. Parameters of regional metal size-frequency distributions are probably less than those of the worldwide distributions. Uranium differs from other metals in that its worldwide size-frequency distribution is approximately lognormal. This may indicate that the lognormal model remains valid as a standard model of size-frequency distribution not only for uranium but also for the metals considered in this chapter, which are predominantly mined from hydrothermal and porphyry-type orebodies. A new version of the model of de Wijs may provide a framework for explaining differences between regional and worldwide distributions. The Pareto tails may reflect history of mining methods with bulk mining taking over from earlier methods in the 20th century. A new method of estimating the Pareto coefficients of the economically important upper tails of the metal size-frequency distributions is presented. A non-parametric method for long-term projection of future metal resource on the basis of past discovery trend is illustrated for copper.

**Keywords** Pareto-lognormal distribution ⋅ Size-frequency distributions Worldwide metal resources ⋅ Future metal supply ⋅ Model of de Wijs

© The Author(s) 2018

F. Agterberg (✉)

Geological Survey of Canada, 601 Booth Street, Ottawa, ON K1A 0E8, Canada e-mail: frits.agterberg@canada.ca; frits@rogers.com

B. S. Daya Sagar et al. (eds.), *Handbook of Mathematical Geosciences*, https://doi.org/10.1007/978-3-319-78999-6\_26

### **26.1 Introduction**

Most models for regional or worldwide mineral or hydrocarbon resource appraisal assume either a lognormal or a Pareto model for the size-frequency distribution of the deposits considered. It can also be assumed that both models apply with the lognormal distribution providing a good fit to all sizes except for the smallest and largest deposits that satisfy fractal/multifractal Pareto distributions. The largest deposits obviously are rare and may be too few in number for adequate modeling in regional studies. However, recently, very large data bases have become available for metal deposits (Patiño Douce 2016a, b, c, 2017). In a newly proposed Pareto-lognormal model for worldwide metal deposit size-frequency distributions (Agterberg 2017a, b, in press), a basic lognormal distribution is flanked by two Pareto distributions. In this chapter this model is applied to copper, zinc, lead, nickel, molybdenum and silver. The upper and lower tail Pareto's are separated from the central lognormal by bridge functions to ensure continuity. An improved version of the Pareto-lognormal model will be applied to the upper tails of the size-frequency distributions for the six metals considered.

Previously, this approach was also applied to gold and uranium (Agterberg 2017b). For gold, the Pareto-lognormal model is not fully satisfied in that there is a shortage of gold deposits with sizes in the vicinity of the median of the worldwide gold size-frequency distribution. For uranium (size measured in tons of U3O8), a lognormal size-frequency distribution without Pareto tails provides a good fit. In the earlier publications (Agterberg 2017a, b, in press) comparisons were made between regional and worldwide size-frequency distributions for copper and gold. Logarithmic variances of worldwide size-frequency distributions exceed those of regional distributions and worldwide separate mineral deposit-type distributions. This observation also applies to the upper tail Pareto size-frequency distributions. A new variant of the model of de Wijs, to be discussed in more detail later in this chapter, can provide a partial explanation of the fact that the worldwide basic lognormal can be regarded as a mixture of regional lognormal distributions with parameters less than those of the worldwide basic lognormals and Pareto's. For example, within the Abitibi volcanic belt on the Precambrian Canadian Shield, the largest deposits for copper and gold satisfy Pareto size-frequency distributions with Pareto parameters (*α*Cu = 0.45; *α*Au = 0.88) that are less than those of their worldwide distributions (*α*Cu = 1.21; *α*Au = 1.16) illustrating that upper tail size parameter estimates for individual metal deposits are not stochastically independent data but subject to spatial correlation.

It should be pointed out that worldwide size-frequency distributions for some metals including copper (2541 deposits) are sufficiently large so that original data (without use of parametric statistical models) can be employed for long-term projections into the future at specific cut-off metal sizes (Agterberg 2017b; also see later in this chapter). Main emphasis in this chapter will be on size-frequency distribution modeling of the upper tail Pareto distribution and its transition into the basic lognormal. This is because total amount of metal in the lower tail of each Pareto-lognormal distribution is negligibly small. For example, 1340 copper deposits with greater than median size contain 99.7% of all copper in the complete data set of 2541 deposits so that information provided by the approximately 50% smaller deposits can be neglected (cf. Patiño Douce 2016c).

Patiño Douce (2016a, b, c, 2017) has published four important papers that are helpful in planning future metal supply; showing, for example, that for copper there would be a deficit of about 2.39 × 10<sup>9</sup> t (tonnes) by the end of this century if recent discovery rates are maintained. For comparison, according to the USGS Mineral Commodity Summaries (2015), proven copper reserves currently are 0.68 × 10<sup>9</sup> t. According to Patiño Douce (2017), current copper resources including the estimated reserves are 2.32 × 10<sup>9</sup> t whereas new demand by 2100 will be 4.70 × 10<sup>9</sup> t. Consequently, estimated future copper deficit is approximately equal to currently known copper resources. Using a non-parametric statistical method, this forecast was confirmed by Agterberg (2017b) who estimated copper resources to be discovered by the end of this century at 2.77 × 10<sup>9</sup> t with 95% confidence interval of ±0.994 × 10<sup>9</sup> that contains Patiño Douce's estimate (also see Sect. 26.5).

Patiño Douce (2016b) is accompanied by a supplementary database with sizes and grades for 20 metals. For example, his data on 2541 copper deposits were compiled from as many as 49 different sources. Patiño Douce (2016b) initially fitted lognormal distributions to the metal deposit size-frequency distributions in this data base pointing out that the logarithmic (base *e*) standard deviation ranges from about 2 to 3 for different metals, although average metal deposit sizes are greatly different. Both Patiño Douce (2016c) and Agterberg (2017a) showed that the largest deposits for different metals can be described by means of Pareto distributions. In the Pareto-lognormal metal size-frequency distribution model of Agterberg (2017a, b) the lognormal has a Pareto upper tail separated from the central lognormal by a bridge zone. This model recognizes both (1) lognormality of metal content of ore deposits from within smaller regions and those belonging to different mineral deposit types (see, e.g. Singer 2013), and (2) Pareto size-frequency distribution of the largest deposits but also for the economically unimportant smallest metal deposits that exhibit Pareto size-frequency distributions as well.

The Pareto-lognormal model for metal deposits provides an alternative to other size-frequency distribution models, which until about 30 years ago almost exclusively were based on the lognormal model. Mandelbrot (1983, p. 263) stated that oil and other natural resources have Pareto distributions and "this finding disagrees with the dominant opinion, that the quantities in question are lognormally distributed. This difference is extremely significant, the reserves being much higher under the hyperbolic than under the lognormal law." It will be seen in this chapter that size frequencies in the upper Pareto tails of the worldwide metal deposits taken for example are less than those of the basic lognormals when these are projected to the largest sizes. In this sense, the metal size frequency distributions are not "heavy-tailed". It can, however, be assumed that the Pareto represents a stable limiting form for the largest as well as the smallest metal deposits. Pareto size-frequency distribution modeling of the largest deposits has during the past 35 years been used by many authors including Drew et al. (1982) and Crovelli (1995) for oil and gas fields, and Cargill (1981), Cargill et al. (1980, 1981) and Turcotte (1997) for metal deposits. The latter author has developed a modification of the model of de Wijs (1951) that results in a Pareto instead of a lognormal distribution. Turcotte (1997) based this model on original publications by Cargill et al. (1980, 1981) and Cargill (1981) who had assumed power-law instead of lognormal models for U.S. mercury, lode gold and copper production. Like the lognormal, the Pareto-lognormal distribution is not universally applicable to all elements, which show bimodal or multimodal size-frequency distributions when all the many different rock bodies within the Earth's crust would be considered.

The fact that uranium has lognormal distribution without Pareto tails suggests that a multiplicative form of the central limit theorem is applicable for this metal and possibly for other metals in different kinds of mineral deposits as well. A new variant of the model of de Wijs described in the next section provides a partial explanation of the fact that the basic lognormal probably can be regarded as a mixture of regional lognormals with parameters that are less than those of the worldwide basic lognormal.

### **26.2 Modified Version of the Model of de Wijs Applied to Worldwide Metal Deposits**

In the original model of de Wijs (1951) for metal concentration values in blocks of rock, any block with metal concentration model *ζ* is repeatedly divided into halves with concentration values (1 + *d*) ˑ *ζ* and (1 − *d*) ˑ *ζ* where *d* is the coefficient of dispersion which us assumed to be independent of block size. The frequency distribution for metal concentration values in increasingly smaller blocks then satisfies the so-called logbinomial distribution that rapidly approaches lognormal form. If

there are *<sup>p</sup>* subdivisions, the logbinomial distribution of the *<sup>p</sup> K* concentration values of the resulting *<sup>n</sup>* = 2*<sup>p</sup>* blocks is

$$X(p,K) = \zeta \cdot (1+d)^{p-K}(1-d)^K$$

where *K* satisfies the binomial distribution with *μ*(*K*) = *p*/2 and variance *σ*2 (*K*) = *p*/4 (cf. Agterberg 1974, p. 322). This logbinomial has *μ*(*X*) = *ζ* and variance *σ*<sup>2</sup> (*X*) approaching to:

$$
\sigma^2(X) = \frac{P}{4} \cdot \left[ \ln \frac{1-d}{1+d} \right]^2
$$

Various modifications of the original model of de Wijs (1951) were developed by Matheron (1962), Turcotte (1997) and Agterberg (2007). These modifications were primarily concerned with randomizing the model of the Wijs (e.g. in the random-cut model), spatial realizations to account for spatial autocorrelation, maximizing *p* (three-parameter model of de Wijs) and producing a Pareto tail (or other types of tail) on the logbinomial (e.g., as in the accelerated dispersion model, Agterberg (2007)). As discussed by Mandelbrot (1983), the model of de Wijs was the earliest example of a multifractal cascade. Lovejoy and Schertzer (2007) have pointed out that this original cascade is micro-canonical in that average metal concentration value is preserved locally at every cut. In universal multifractal theory these authors have generalized the cascade-type approach by preserving regionalized instead of strictly local averages. Their approach can result in a cascade that is largely lognormal but generates tails which are exactly Pareto-type. Here, another modified version of the original model of de Wijs (1951) is introduced as follows.

Suppose that the sizes of all deposits are combined with one another into a single very large block which is assigned to an arbitrary point in the upper part of the Earth's crust that contains metal deposits that have been or can be discovered. Suppose further that this block is divided into halves and the two smaller blocks are assigned to two points randomly located within halves of the upper part of the Earth's crust. This process can be repeated 2*<sup>p</sup>* times. At each step, the two resulting half-blocks of metal are further divided into halves that, after every cut, are randomly assigned to successively smaller segments of the upper Earth's crust. If there are *<sup>n</sup>* known deposits the cascade process is repeated until *<sup>n</sup>* <sup>≤</sup> <sup>2</sup>*<sup>p</sup>* . For example, in relatively well-known parts of the Earth's crust there occur 2541 copper deposits. Suppose that *p* = 12 so that total number of subdivisions would be 4096. The 2541 copper deposits then can be regarded as a random subset of this larger population, so that the overall mean copper content value *ζ* and the coefficient of dispersion *d* can be estimated. From the parameters of the straight line representing the basic copper lognormal distribution (Fig. 26.2a, see later) it follows that the logarithmic (base *e*) mean and standard deviation are *μ* = 10.445 and *σ* = 3.1062. Consequently, *ζ* = exp (*μ* + *σ*<sup>2</sup> /2) = 4.277 × 10<sup>9</sup> . It then follows that *d* = 0.7276.

"Observed" frequencies satisfying the log-binomial model are shown in Fig. 26.1. The best-fitting straight line (*y* = 0.755*x* – 3.8123) in this diagram has coefficients corresponding to mean *μˊ* = 11.627 and standard deviation *σˊ* = 3.050 which are relatively poor estimates in comparison to the values to derived later for the basic lognormal for copper in Fig. 26.2a. Main reason for this minor discrepancy is relatively strong influence on the best-fitting regression line of logbinomial frequencies represented by first and last points which are for single blocks only. Positions of these two points illustrate that the logbinomial produces slightly weaker upper and lower tails in comparison with the lognormal. On the whole, the logbinomial very closely approximates the lognormal in this application.

The preceding model would allow for spatial autocorrelation of metal deposit size observations, which is known to exist. For example, the largest copper deposits are porphyry type and largely clustered in the Andes mountain chain of South America. On the other hand, the largest copper deposits in the Abitibi volcanic belt on the Canadian Shield are volcanogenic massive sulphide deposits which are smaller than the South American porphyry coppers. Because of the close resemblance of the **Fig. 26.1** Model of de Wijs applied to worldwide copper deposit size-frequency distribution. Overall mean set equal to *ζ* = 4.277 Mt copper; dispersion index *d* = 0.7276; number of subdivisions *p* = 12. "Observed" frequencies satisfy log-binomial model. Best-fitting straight line represents lognormal distribution. Logbinomial frequencies represented by first and last point are for single blocks only (*Source* Agterberg, in press)

logbinomial to the lognormal, preceding results also can be represented as follows. The characteristic function of a random variable *X* is:

$$g(\mu) = E(e^{i\mu x}) = \int\_{-\infty}^{\infty} e^{-i\mu x} f(x) dx$$

where *f*(*x*) is the probability density function of *X*. Characteristic functions are discussed in statistical textbooks including Billingsley (1986) and Bickel and Doksum (2001). For a normal distribution:

$$g(u) = e^{i\mu u - \sigma^2 u^2 / 2}$$

If *Z*, with mean *μ<sup>z</sup>* and variance *σ*<sup>2</sup> *<sup>z</sup>*, represents the sum of two random variables *X* and *Y*, then the respective three characteristic functions satisfy:

$$g\_z(\mu) = g\_x(\mu) \cdot g\_y(\mu).$$

**Fig. 26.2** Lognormal *Q*-*Q* plots for six metals (Cu, Zn, Pb, Ni, Mo and Ag). Coefficients of straight lines representing truncated lognormal distributions are shown in Table 26.1. Sample sizes are shown in Table 26.2. In each case, frequencies for the largest and smallest deposits deviate from the straight-line pattern indicating lower and higher number frequencies than expected on the basis of the lognormal size frequency distribution models represented by the straight lines

If *X* is normal with zero mean and variance *σ*<sup>2</sup> *<sup>x</sup>* , and *Y* is normal as well with mean *μ<sup>y</sup>* and variance *σ*<sup>2</sup> *<sup>y</sup>* , then *Z* is normal with:

$$g\_z(\mu) = e^{\left[i(\mu\_x + \mu\_\gamma) \cdot \mu - (\sigma\_x^2 + \sigma\_\gamma^2) \cdot \mu^2/2\right]}$$

Consequently, the probability density function of *Z* is:

$$f(z) = \frac{1}{\sqrt{\sigma\_x^2 + \sigma\_y^2} \cdot \sqrt{2\pi}} e^{-\left\{z - (\mu\_x + \mu\_y)\right\}^2 \cdot \left\{2 \cdot (\sigma\_x^2 + \sigma\_y^2)\right\}^{-1}}$$

Interpretation of this result in the context of worldwide metal deposits can be as follows. Suppose that log *Z* represents the basic lognormal metal deposit size-frequency distribution with logarithmic mean *μ<sup>z</sup>* = *μ<sup>x</sup>* +*μ<sup>y</sup>* and logarithmic variance *σ*<sup>2</sup> *<sup>z</sup>* = *σ*<sup>2</sup> *<sup>x</sup>* +*σ*<sup>2</sup> *<sup>y</sup>* . Then log *Z* can be regarded as a composite of many regional lognormal distributions with different means and lesser logarithmic variances, much like as in the previous version of the model of de Wijs the overall logbinomial would consist of regional logbinomials with different parameters.

### **26.3 Theory and Applications of the Pareto-Lognormal Model**

The cumulative frequency distribution for the Pareto-lognormal distribution *F*(*x*) = *F*(log *x*) can be written as

$$\begin{split} F(\log x) &\approx \Phi\left(\frac{\log x - \mu}{\sigma}\right) + H(\log x - \mu) \cdot B\_1(\log x) \cdot (\log x - \mu)^{-\alpha} \\ &+ H(\mu - \log x) \cdot B\_2(\log x) \cdot (\mu - \log x)^{-\kappa} \end{split}$$

where <sup>Φ</sup> log *<sup>x</sup>*−*<sup>μ</sup> μ* represents the basic lognormal (logs base 10). *H* (…) is the Heaviside function that applies to two filtered Pareto distributions, for positive and negative values of (log *x* - *µ*), respectively; it signifies that values at the other side of *µ* are set equal to zero when the equation is applied to either the upper tail or the lower tail of the Pareto-lognormal distribution.. The bridge functions *B*1(log *x*) and *B*2(log *x*) span relatively short intervals between the basic lognormal and the Pareto distributions for the largest and smallest values, respectively. They satisfy lim*<sup>x</sup>* <sup>→</sup> <sup>∞</sup> *<sup>B</sup>*<sup>1</sup>ð Þ log *<sup>x</sup>* = lim*<sup>x</sup>* <sup>→</sup> <sup>0</sup> *<sup>B</sup>*<sup>2</sup>ð Þ log *<sup>x</sup>* = 1 and lim*<sup>x</sup>* <sup>→</sup> <sup>0</sup> *<sup>B</sup>*<sup>1</sup>ð Þ log *<sup>x</sup>* = lim*<sup>x</sup>* <sup>→</sup> <sup>∞</sup> *<sup>B</sup>*<sup>2</sup>ð Þ log *<sup>x</sup>* = 0.

The Pareto-lognormal probability density function *f*(log *x*) corresponding to *F*(log *x*) can be written as

$$\begin{split} f(\log \boldsymbol{x}) &\approx \boldsymbol{\varrho} \left( \frac{\log \boldsymbol{x} - \boldsymbol{\mu}}{\sigma} \right) + H(\log \boldsymbol{x} - \boldsymbol{\mu}) \cdot \boldsymbol{B}\_1^{'}(\log \boldsymbol{x}) \cdot (\log \boldsymbol{x} - \boldsymbol{\mu})^{-a-1} \\ &+ H(\boldsymbol{\mu} - \log \boldsymbol{x}) \cdot \boldsymbol{B}\_2^{'}(\log \boldsymbol{x}) \cdot (\boldsymbol{\mu} - \log \boldsymbol{x})^{-\kappa - 1} \end{split}$$

It may be useful for prediction of resources to be discovered in the future. The exponents in <sup>ð</sup>log *<sup>x</sup>*<sup>−</sup> *<sup>μ</sup>*<sup>Þ</sup> <sup>−</sup>*α*−<sup>1</sup> and ð*μ*<sup>−</sup> log *<sup>x</sup>*Þ<sup>−</sup>*<sup>κ</sup>* <sup>−</sup><sup>1</sup> reflect the fact that the Pareto probability density function remains linear on a plot with logarithmic scales for both frequency and deposit size, but has a steeper dip.

The lognormal *QQ*-plot (logarithmic probability paper) provides a useful first step in fitting the Pareto-lognormal distribution. Figure 26.2 contains results for the six metals. Original data were taken from Patiño Douce (2016b). Each graph shows a straight-line pattern with departures from lognormality in the upper and lower frequency distribution tail. Relatively, there are too many smallest deposits and too few largest deposits. In the Pareto-lognormal model both the upper and lower tail distributions have transitions to the central lognormal that are gradual and described by the two bridge functions. For projections into the future (or for global downward projections into the Earth's crust) only the upper tails of the size-frequency distributions are of economic interest. In the next section, a new, relatively simple method will be described for fitting the upper tail Pareto distributions. The upper tail bridge function will be fitted empirically by connecting this Pareto to the central lognormal distribution. Copper can be used for illustrating details of the methods used. The straight line *y* = *bx* + *a* in Fig. 26.2a for copper represents the basic lognormal with coefficients *a* = −3.314 and *b* = 0.741 derived from the logarithmic mean *μ* = −*a*/*b* = 4.469 and standard deviation *σ* = 1/*b* = 1.349 of a truncated lognormal for which 10% (or 254 values) in both upper and lower tail were excluded from the sample of 2541 original copper deposit size values. The mean *μ* of this truncated distribution is only slightly different from 4.403 representing the logarithmic mean of all values. The basic lognormal standard deviation *σ* = 1.349 is slightly less than 1.423 representing the standard deviation based on all values because there are relatively many copper deposit size values in the lower tail. It was obtained by dividing 0.893 representing the standard deviation of the truncated copper data set by 0.662, representing a value taken from Johnson and Kotz (1970, Table 10, p. 84). Other published truncation correction factors were used for metals with wider upper or lower tails. Coefficients for all six straight lines shown in Fig. 26.2 are given in Table 26.1. The basic statistics estimated for all six metals shown in Table 26.2 were taken from Agterberg (2017a, b and in press) except for the upper tail Pareto coefficients with slightly different values newly derived by the method to be described in the next section.


**Table 26.2** Comparison of basic statistics for eight metals including the six metals represented in Table 26.1 and Figs. 26.2, 26.3 and 26.4. N—number of deposits; Mt—million tons, t metric tons; LM, LS—logarithmic mean and standard deviation; *μ*, *σ*—ditto for truncated lognormal; *α*, *κ* upper and lower tail Pareto coefficients


### **26.4 Upper Tail Pareto Distribution and Its Connection to the Basic Lognormal Distribution**

The cumulative Pareto distribution function satisfies

$$F(x) = 1 - \left(\frac{k}{x}\right)^{\alpha}$$

where *α* > 0 and *k* > 0 are its two parameters. The following maximum likelihood estimator of the Pareto coefficient *α* has been used in several publications (Clauset et al. 2009; Patiño Douce 2016c; Agterberg 2017b) in various ways:

$$\alpha = \frac{n}{\sum\_{i=1}^{n} \ln \frac{x\_i}{k}}$$

where *n* represents number of metal deposits selected in an ordered sequence of values *xi* (*i* = 1, 2, …, *n*), and *k* is the critical size parameter representing the truncation point at which the maximum value of the Pareto probability—density drops to zero. In the original algorithm of Clauset et al. (2009), which was used by Patiño Douce (2016c), all possible values of *k* are tested for sizes *x*1 < *x*2 < *x*<sup>3</sup> … <sup>&</sup>lt; *xn*. Minimum size (*x*1) was set at median size and *xn* at maximum size. Each sample of *n* sizes provides a different estimate of *k* and *α*. The Kolmogorov-Smirnov test was used to find the Pareto distribution that provides the best fit.

In Agterberg (2017b)'s application, *x*1 > *x*2 > *x*<sup>3</sup> … <sup>&</sup>gt; *xn*, was used instead. This reversal of order was based on the following three premises: (1) worldwide metal deposit size sample sizes are very large ensuring that cumulative frequencies become increasingly precise when *n* is increased, regardless of whether or not the Pareto distribution model is satisfied; (2) starting with the largest deposits and increasing sample size by including progressively more deposits improves results if the Pareto distribution model would indeed be satisfied; and (3) for increasingly large values of *n*, observed frequencies become increasingly less than expected Pareto distribution model frequencies because the upper tail Pareto gradually passes into the lower frequency density basic lognormal via the upper tail bridge function. Theoretically, if *α* is known, *k* could be derived from *α* by using the preceding equation for the maximum likelihood estimator. In Agterberg (2017a, b, in press), *α* was pre-determined by visual inspection for 7 metals that all show approximately linear patterns in log rank—log size plots for their largest deposits.

Initially, for small values of *n*, the resulting patterns for copper and other metals show large random fluctuations. For larger values of *n* the plots develop multi-peak patterns for *α* that are superimposed on a gradational decrease. In Agterberg (2017b) a straight line was fitted by least squares for copper and gold avoiding the large small-sample fluctuations at the largest size values end capture the downward bend of log rank values toward lower log size values. This procedure produced estimates of *k*Cu = 6.98 and *k*Au = 8.98. Both estimates were confirmed by more detailed analysis of cumulative frequencies for largest deposits yielding *k*Cu ≈ 7.0 and *k*Au ≈ 9.2.

However, the preceding method does not work very well for some metals with fewer data than copper and gold. The following relatively simple method gave good results for six metals as shown in Figs. 26.3 and 26.4. The value of *n* was set equal to 20 in each application for a window that was slid along the series of ordered metal deposit size values from the largest deposit downward. Initial random fluctuations connected to the largest values were avoided and so were windows on the upper bridge function transition zone toward the basic lognormal size-frequency distribution. For copper this procedure gives *α* = −1.2059 for *k* = 6.996. The straight line with slope *α* passing through the point with average log size and average log rank for the 20 pairs of copper deposit size values used is shown in Fig. 26.3a. Similar results for the other five metals are shown in Figs. 26.3 and 26.4. According to the Pareto-lognormal model, a decrease of estimated values of *α* at the point where the upper tail Pareto ceases to be applicable is indeed expected. However, it is not clear why there is an equally strong decrease of estimated values of *α* in the patterns of Fig. 26.3 from the peak outward toward increasing values of log (deposit size). Very large random fluctuations are known to exist for the largest deposits. However, the upper tail downward trends in Fig. 26.3 could mean larger sizes than expected for the largest deposits although there are no indications of this in Fig. 26.4. Neither are there obvious deviations from linearity in log rank—log size plots that include the largest deposits for various metals (Agterberg 2017a, b). Residuals from the straight lines representing the Pareto distributions show relatively strong autocorrelation. Because of this uncertainty, it remains important to look for alternative upper tail models like the lognormals proposed by Patiño Douce (2016c, 2017) and shown for copper and gold in Agterberg (2017b). These alternate lognormals differ from the basic lognormals primarily in that they have much large mean deposit size values.

In order to fully represent the upper tail cumulative size-frequency distribution, the Pareto's have to be connected to the basic lognormals. Taking copper for

**Fig. 26.3** Pareto coefficient (*α*) for log of metal deposit size as obtained in the text, setting *n* equal to 20 for overlapping data sets moving from larger to smaller log (deposit size) values. Maximum *α* will be taken as optimum value with data sets, on which it is based, for the six metals shown in Fig. 26.4

example again, it can be seen in Fig. 26.2a that observed frequencies deviate from the best-filling straight line for log Cu deposit size values greater than 6. In total 42 deposits have log Cu deposit size values greater than 7 and their observed cumulative frequency of 42/2524 can be used as an anchor point to connect the upper tail Pareto to the upper bridge function which represents the transition zone between the basic lognormal (for values < 6) and the Pareto (for values ≥ 7). Table 26.3 shows anchor points used for all six metals considered. Figure 26.5 shows best-fitting

**Fig. 26.4** Sets of twenty log (metal deposit size) values corresponding to maximum value of *α* in Fig. 26.3 for the six metals. Corresponding Pareto distribution functions are shown as straight lines on these log rank—log size plots



**Fig. 26.5** Upper tails of Pareto-lognormal size-frequency distributions for the six metals constructed by using the method explained in Table 26.4. Upper tails bridge functions are smooth curves that satisfy quartic polynomials fitted by least squares to log size values satisfying basic lognormal on the left and upper tail Pareto on the right side. For copper the result does not differ significantly from sextic polynomial previously shown in Agterberg (2017b, Fig. 14). For molybdenum no bridge function was fitted. Points with log (Mo deposit size) ≤5 satisfy basic lognormal shown as straight line on Fig. 26.2e; points with log (Mo deposit size) ≥5 belong to the upper tail Pareto distribution

frequency distribution curves that are Pareto-type for log deposit size values exceeding the anchor points. Some anchor points slightly exceed the estimated values of the truncation parameters *k* without significantly changing the results. Quartic polynomials were used to approximate the smooth shapes of each frequency distribution within the upper tail bridge function that connects the Pareto with the basic lognormal. Table 26.4 shows results of this interpolation procedure for copper. The curve in Fig. 26.5a resembles the curve previously shown in Agterberg (2017a) where it was a best-fitting sextic polynomial. Contrary to the fitting of sextic polynomials to other metals, the method using a quartic explained in Table 26.4 gave good results for the other metals considered with the exception of molybdenum that does not seem to need a bridge function to pass from the Pareto into the basic lognormal. It is the only metal for which the upper tail Pareto and the central lognormal almost continuously pass into one another. Molybdenum, therefore, almost exactly satisfies the model proposed by Patiño Douce (2016b, Appendix 1) in which the probability density function of the lognormal as well as its first derivative pass continuously into the density function of the Pareto. The value at log (Mo deposit size) = 5 predicted by the basic lognormal is equal to the value of the Pareto at this point. Figure 26.5e, however, shows that there is a slight change of dip of the curve for log (1—cumulative frequency) at this point. All frequency distribution curves in Fig. 26.5 are close to their observed cumulative frequencies also shown in these diagrams.

### **26.5 Prediction of Future Copper Resources**

As previously pointed out in Agterberg (2017b; in press), one of the purposes of developing statistical models of the size-frequency distributions of worldwide metal deposits is to use these models for prediction purposes either spatially (e.g., from relatively well-explored regions to unexplored regions, or deeper down from the Earth's surface), or in time. For multifractal modeling of the spatial distribution of mineral deposits, see Cheng (1994) or Cheng and Agterberg (1995). Use of parametric models is discussed by many authors including Agterberg (1974), Patiño Douce (2017) and Agterberg (2017b). The following non-parametric approach was first presented in the latter paper.

Suppose that *X* is a continuous random variable denoting mineral deposit size and that *K* is a discrete random variable for number of deposits per unit of area, volume or time; then the continuous random variable *Y* representing the sum of the sizes of the *K* deposits satisfies:

$$Y = X\_1 + X\_2 + \dots + X\_K$$

**Table 26.4** Curve connecting smoothed *y*-values for log10 (1—cumulative frequency) in Fig. 26.5a in comparison with observed *y*-values for copper. Commencing as lognormal, the curve passes gradually into the straight line for its upper tail Pareto. Smoothed *y*-values for the intermediate bridge function satisfy a quartic polynomial equation fitted by least squares to lognormal values for *x* ≤ 5 and *x* ≥ 7. Smoothed values include quartic polynomial for *x* = 5 and *x* = 7


The mean *E*(*Y*) and variance *σ*<sup>2</sup> (*Y*) satisfy:

$$E(Y) = E(K) \cdot E(X); \quad \sigma^2(Y) = E(K) \cdot \sigma^2(X) + \sigma^2(K) \cdot E^2(X)$$

These equations were previously used in Agterberg (1974, Eq. 7.72) who had adopted them from Feller (1968, Chap. 12) where they are derived for *K* and *X* both representing integral-valued random variables. The approach also is applicable when *X* is a continuous random variable. The variance equation can be found in an online article on compound distributions (Lin 2014, Eq. (4)) with many additional references. Specific distribution models can be assumed to hold true for *K* and *X*. However, as shown earlier in this chapter, significant uncertainties remain in modeling the upper tail of worldwide metal size-frequency distributions that contain most metal. Fortunately, samples now available for statistical modeling are so large that the following non-parametric approach can be used.

Patiño Douce (2017) contains tables with statistics based on number of 1950– 2007 copper deposit discoveries originally derived from a plot by Schodde (2010) for copper deposits with size > 3 × 10<sup>5</sup> t Cu. Mean and variance of yearly number of discoveries are 8.621 and 14.304, respectively. Extrapolation of these two parameters over 85 years, toward the end of this century, would yield an expected number of 732.8 discoveries with variance of 12.158 × 10<sup>3</sup> . Patiño Douce (2016b)'s original data base contains 591 copper deposits with sizes > 3 × 10<sup>5</sup> t Cu resulting in estimated values of *E*(*X*) = 3.784 × 10<sup>6</sup> t and *σ*<sup>2</sup> (*X*) = 1.135 × 1014. Because of the large sample size, the 95% confidence limits on the estimated mean value are 3.784 × 10<sup>6</sup> ± 0.859 × 10<sup>6</sup> t with the large sample ensuring approximate normality of the frequency distribution of this mean. Consequently, this estimate is rather precise. Using the preceding equations for mean *E*(*Y*) and variance *σ*<sup>2</sup> (*Y*), it follows that estimated total tonnage copper value amounts to 732.8 × 3.784 × 10<sup>6</sup> = 2.773 × 10<sup>9</sup> t. The corresponding variance amounts to 25.726 × 1016, from which it follows that the 95% confidence limits on the estimated mean value are 2.773 × 10<sup>9</sup> ± 0.994 × 10<sup>9</sup> t. This mean value is approximately normally distributed as well. Although the method for deriving this result differs significantly from the computer simulation method used by Patiño Douce (2017), the end result is only 0.654 × 10<sup>9</sup> t greater and the difference between the two estimates is not statistically significant. These results confirm Patiño Douce (2017)'s conclusion that there would be a significant shortage of copper if current rates of discovery will be maintained. The problem would become even worse if future rates would decrease.

### **26.6 Concluding Remarks**

In this chapter it was argued that publicly available large metal deposit size data bases (especially Patiño Douce 2016b) allow new kinds of statistical modeling of regional and worldwide metal resources. The two models most frequently used in the past are lognormal size-grade and Pareto upper tail modeling. Both approaches are probably valid for several metals including copper, zinc, lead, nickel, molybdenum and silver taken for example because the upper tails of their mostly lognormal size frequency distributions satisfy the Pareto distribution model. The worldwide metal size-frequency distributions for these metals are similar indicating that a central, basic lognormal distribution is flanked by two Pareto distributions from which it is separated by upper and lower tail bridge functions. The lower tail Pareto distribution shows an excess of small deposits which are not economically important. Number frequencies of the upper Pareto are mostly less than those of the basic lognormal. A new method for fitting the upper tail Pareto was introduced and produces good results for the six metals taken for example. Parameters of regional metal size-frequency distributions as well as those of mineral deposit type distributions are less than those of the worldwide distributions. Uranium differs from other metals in that its worldwide size-frequency distribution is approximately lognormal. This may indicate that the lognormal model remains a standard model of size-frequency distributions of metals predominantly mined from hydrothermal and porphyry-type orebodies. A new version of the model of de Wijs may provide a framework for explaining the differences between regional and worldwide distributions. Further research on this topic remains to be carried out. The Pareto tails may reflect historical mining methods with bulk mining becoming prevalent in the 20th century. A new method of estimating the Pareto coefficients of the economically important upper tails of the size-frequency distributions was presented, and a non-parametric method for long-term projection of future metal resource on the basis of past discovery trend was illustrated for copper.

### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **Part IV Reviews**

### **Chapter 27 Bayesianism in the Geosciences**

**Jef Caers**

**Abstract** Bayesianism is currently one of the leading ways of scientific thinking. Due to its novelty, the paradigm still has many interpretations, in particular with regard to the notion of "prior distribution". In this chapter, Bayesianism is introduced within the historical context of the evolving notions of scientific reasoning such as inductionism, deductions, falsificationism and paradigms. From these notions, the current use of Bayesianism in the geosciences is elaborated from the viewpoint of uncertainty quantification, which has considerable relevance to practical applications of geosciences such as in oil/gas, groundwater, geothermal energy or contamination. The chapter concludes with some future perspectives on building realistic prior distributions for such applications.

### **27.1 Introduction**

Much of the topic of research within the IAMG community involves developing tools for prediction: what is the grade? The volume of Oil in Place? The spatio-temporal changes of a contaminant plume? Making realistic predictions, meaning providing realistic uncertainty quantification, is key to making informed decisions. Decisions and their consequences are what matters in the end, not the kriging map of gold, or simulated permeability, or hydraulic conductivity. These are only intermediate steps to decision-making. In this chapter, I focus on a fundamental discussion on how we make predictions in the Geosciences and about the current leading paradigm: Bayesianism. This chapter is a revised version of the book "Quantifying Uncertainty in Subsurface Systems", Scheidt et al. Wiley Blackwell, 2018. The term UQ is therefore used for "Uncertainty Quantification"

Most of our applications involve three major components: data, a model and a decision. For example, in contaminant hydrology, we need to decide on a

J. Caers (✉)

Stanford University, Stanford, USA e-mail: jcaers@stanford.edu

<sup>©</sup> The Author(s) 2018 B. S. Daya Sagar et al. (eds.), *Handbook of Mathematical Geosciences*, https://doi.org/10.1007/978-3-319-78999-6\_27

remediation strategy or simply a decision to clean or not. We collect data: geochemical samples, geological studies, possibly even some geophysical surveys. We build models: a reactive transport model, a geostatistical model of spatial properties, a geochemical model. How does this all come together? Bayesian modeling is usually invoked as a way of integrating all these components. But what really constitutes "Bayesian" modeling? Thomas Bayes did not write Bayes' rule in the form we often see it in textbooks. However, after a long period of being mostly ignored in history, his idea of using a "prior" distribution heralded a new way of scientific reasoning which can be broadly classified as: Bayesianism. The aim of this chapter is to frame Bayesianism within the historical context of other forms of scientific reasoning such as induction, deduction, falsification, intuitionism and others. The application of Bayesianism is then discussed in the context of uncertainty quantification and specific to the Geosciences. This makes sense since quantifying uncertainty is about quantifying a lack of understanding or lack of knowledge. Science is all about creating knowledge. But then, what do we understand and what exactly is knowledge (the field of epistemology)? How can this ever be quantified with a consistent set of axioms and definitions, that is, if a mathematical approach is taken? Is such quantification unique? Is it rational at all to quantify uncertainty? Are we in agreement as to what Bayesianism really is?

These questions are not just practical questions towards engineering solutions, but to a deeper discussion around uncertainty. This discussion is philosophical, a discussion at the intersection of philosophy, science and mathematics. The science of studying knowledge and as a result, uncertainty. In many papers published journals that address uncertainty in subsurface systems, or in any system for that matter, philosophical views are rarely touched upon. Many such publications would start with the "we take the Bayesian approach…" or, "we take a fuzzy logic approach to…." But what entails making this decision? Papers quickly become about algebra and calculus. Bayes or any other way of inferential reasoning is simply seen as a set of methodologies, technical tools and computer programs. The emphasis lies on the beauty of the calculus, solving the puzzle, improving "accuracy" not on any desire of deeper understanding to what exactly one is quantifying. A pragmatic realist may state that in the end, the answer is provided by the computer codes, based on the developed calculus. Ultimately, everything is about bits and bytes and transistors amplifying or switching electronic signals; inputs and outputs. The debate is then which method is better, but such debate is only within the choices of the particular way of reasoning about uncertainty. That particular choice is rarely discussed. The paradigm is blindly accepted.

Bayes is like old medicine, we know how it works, what the side effects are and has been debated, tweaked, improved, discussed since Reverend Bayes' account was published by Price (1763). Our discussion will start with a general overview of the scientific method and the philosophy of science. This discussion will be useful in the sense that it will help introduce Bayesianism, as a way of inductive reasoning, compared to very different ways of reasoning. Bayes is popular, but not accepted by all (Earman 1992; Wang 2004; Gelman 2008; Klir 1994).

### **27.2 A Historical Perspective**

In the philosophy of sciences, fundamental questions are posed such as: what is a "law of nature"? How much evidence and what kind of evidence should we use to confirm a hypothesis? Can we ever confirm hypotheses as truths? What is truth? Why do we appear to rely on inaccurate theories (e.g. Newtonian physics) in the light of clear evidence that they are false and should be falsified? How does science and the scientific method work? What is science and what is not (the demarcation problem)? Associated with the philosophy of science are concepts such as epistemology (study of knowledge), empiricism (the importance of evidence), induction and deduction, parsimony, falsification, paradigm…. all of which will be discussed in this chapter.

Aristotle (384-322 BC) is often considered to be the founder of both science and the philosophy of science. His work covers many areas such as physics, astronomy, psychology, biology, and chemistry, mathematics, and epistemology. Attempting to not solely be Euro-centric, one should also mention the scientist and philosopher Ibn al-Haytham (Alhazen), who could easily be called the inventor of the peer-review system, on which this chapter too is created. In the modern era, Galileo Galilei and Francis Bacon take over from the Greek philosophy of thought (rationality) over evidence (empiricism). Rationalism was continued by Rene Descartes. David Hume introduced the problem of induction. A synthesis of rationalism and empiricism was provided by Emanuel Kant. Logical positivism (Wittgenstein, Bertrand Russel, Carl Hempel) ruled much of the early twentieth century. For example, Bertrand Russel attempted to reduce all of mathematics to logic (logicism). Any scientific theory then requires a method of verification using a logic calculus in conjunction with the evidence, to prove such theory true of false. Karl Popper appeared on the scene as a reaction to this type of reasoning, replacing verifiability with falsifiability, meaning that for a method to be called scientific, it should be possible to construct an experiment or acquire evidence that can falsify it. More recently Thomas Kuhn (and later Imre Lakatos) rejected the idea that one method dominates science. They see the evolution of science through structures, programs and paradigms. Some philosophers such as Feyerabend go even further ("Against method", Feyerabend 1993) stating that no methodological rules really exist (or should exist).

The evolution of the philosophy of science has relevance to UQ. Simply replace the concept of "theory" with "model", and observations/evidence with data. There is much to learn from how people's viewpoints towards scientific discovery differs; how they have changed and how such change has affected our ways of quantifying uncertainty. One of the aims of this chapter therefore is to show that there is not really a single objective approach to uncertainty quantification based on some laws or rules provided by a passive, single entity (the truth-bearing clairvoyant God!). Uncertainty quantification just like science is dynamic, relies on interaction between data, models and predictions and evolving views on how these components interact. It is with high certainty that few methods covered in this chapter will not be used in 100 years; just consider the history of science as evidence.

### **27.3 Science as Knowledge Derived from Facts, Data or Experience**

Science has gained considerable credibility, including in everyday life, because it is sold as "being derived from facts". It provides an air of authority, of truth to what are mainly uncertainties in daily life. This was basically the view with the birth of modern science in the seventeenth century. The philosophies that exalt this view are empiricism and positivism. Empiricism states that knowledge can only come from sensory experience. The common view was that (1) sensory experience produces facts to objective observers, (2) facts are prior to theories (3) facts are the only reliable basis for knowledge.

Empiricism is still very much alive in the daily practice of data collection, model building and uncertainty quantification. In fact, many scientists find UQ inherently "too subjective" and of lesser standing than "data", physical theories or numerical modeling. Many claim that decisions should be based merely on observations, not models.

*Seeing is believing*. "Data is objective, models are subjective". If facts are to be derived from sensory experience, mostly what we see, then consider Fig. 27.1. Most readers see a panel of squares, perhaps from a nice armoire. Others (very few) see circles and perhaps will interpret this as an abstract piece of art with interesting geometric patterns. Those who don't see circles at first, need to simply look longer, with different focusing of their retinas. Hence, there seems to be more than meets the eyeball (Hanson 1958). Consider another example in Fig. 27.2. What do you see? Most will recognize this as a section of a geophysical image (whether seismic, radar etc.…). A well-trained geophysicist will potentially observe a "bright spot" which may indicate the presence of a gas (methane, carbon dioxide) in the subsurface formations. A sedimentologist may observe deltaic formations consisting of channel stacks. Hence, the experience in viewing an object is highly dependent on the interpretation of the viewer and not the pure sensory light perceptions hitting one's retina. In fact, Fig. 27.2 is a modern abstract work of art by Mark Bardford (1963) on display in the San Francisco Museum of Modern Art (September 2016).

**Fig. 27.2** What do you see?

Anyone can be trained to make interpretations, and this is usually how education proceeds. Even pigeons can be trained to spot cancers as well as humans, Levenson et al., PLOS ONE (18 November 2015) http://www.sciencemag.org/news/2015/11/ pigeons-spot-cancer-well-human-experts. But this idea may also backfire. First off, the experts may not do better than random (Financial times, March 31, 2013: "Monkey beats man on stock market picks", based on a study by the Cass Business School in London), or worse produce cognitive biases, as pointed out by a study of interpretation seismic images (Bond et al. 2007).

*First facts, then theory*. Translated to our UQ realm as "first data, then models". Let's consider another example in Fig. 27.3, now with actual geophysical data and not a painting. A statement of fact would then be "this is a bright spot". Then, in the empiricist view, deduction, conclusions can be derived from it ("It contains gas").

**Fig. 27.3** Not art, just a geophysical image

However, what is relevant here is the person making this statement. A lay person will state as fact "There are squiggly lines". This shows that any observable fact is influenced by knowledge ("the theory") of the object of study. Statements of fact are therefore not simply recordings of visual perceptions. Additionally, quite an amount of knowledge is needed to consider taking the geophysical survey in the first place, hence facts do not proceed theory. This is the case for the example here but a reality for many scientific discoveries (we need to know where to look). A more nuanced view therefore is that data and models interact with each other.

*Facts as the basis for knowledge*. "Data precedes the model". If facts depend on observers resulting in statements that depend on such observers, and if such statements are inherently subjective, then can we trust data as a prerequisite to models (data precede models)? It is now clear that data does not come without a model itself, and hence if the wrong "data model" is used, then the data will be used to build incorrect models. "If I jump in the air and observe that I land on the same spot, then 'obviously' the Earth is not moving under my feet". Clearly the "data model" used here is lacking the concept (theory) of inertia. This again reinforces the idea that in modeling, and in particular UQ, data does not and should precede the model, or that one is subjective and the other somehow is not.

# **27.4 The Role of Experiments—Data**

Progress in science is usually achieved by experimentation, the acquisition of information in a laboratory or field setting. Since "data" is central to uncertainty quantification, we spend some time on what "data" is, what "experiments" aim to achieve and what the pitfalls are in doing so.

First, the experiment is not without the "experimenter". Perceptual judgements may be unreliable, and hence such reliance needs to be minimized as much as possible. For example, in Fig. 27.4, the uninformed observer may notice that the moon is larger when on the horizon, compared to higher up in the sky, which is merely an optical illusion (on which there still is no consensus as to why). Observations are therefore said to be both objective as well as fallible. Objective in the sense that they are shared (in public, presentations, papers, online) and subject to further tests (such measuring of the actual moon size by means of instruments, revealing the optical illusion). Often such progress happens when more advances in the ways of testing or gathering data occur.

Believing that a certain acquisition of data will resolve all uncertainty and lead to determinism on which "objective" decisions is an illusion because the real world involves many kinds of physical/chemical/biological processes that cannot be captured by one way of experimentation. For example, performing a conservative tracer test, to reveal better hydraulic conductivity, may in fact be influenced by the reactions in the subsurface taking place while doing such an experiment. Hence the

**Fig. 27.4** The harvest moon appearing gigantic as compared to the moon in the high sky (https://commons. wikimedia.org/wiki/File: Harvest\_Moon\_over\_ looking\_vineyards.jpg)

hydraulic conductivity measured and interpreted through some modeling without geochemical reactions may provide a false sense of certainty about the information deduced from such an experiment. In general, it is very difficult to isolate a specific target of investigation in the context of one type of experiment or data acquisition. A good example is in the interpretation of 4D geophysics (repeated geophysics). The idea of the repetition is to remove the influence of those properties that do not change in time, and therefore reveal only those that do change, for example, a change in pressure, a change in saturation, etc. … However, many processes may be at work at the same time, a change in pressure, in saturation, rock compressibility, even porosity and permeability, geomechanical effects, etc. … Hence someone interested in the movement of fluids (change in saturation) is left with a great deal of difficulty in unscrambling the time signature of geophysical sensing data. Furthermore, the inversion of data into a target of interest often ignores all these interacting effects. Therefore, it does not make sense to state that a pump test or a well test reveals permeability, it only reveals a pressure change under the conditions of the test and of the site in question, and many of these conditions may remain unknown or uncertain.

An issue that arises in experimentation is the possibility of a form of circular reasoning that may exist between an experimental set-up and a computer model aiming to reproduce the experimental set-up. If experiments are to be conducted to reveal something important about the subsurface (e.g. flow experiments in a lab), then often the results of such experiments are "validated" by a computer model. Is the physical/chemical/biological model implemented in the computer code derived from the experimental result, or, are the computer models used to judge the adequacy of the result? Do theories vindicate experiments and do experiments vindicate the stated theory? To study these issues better, we introduce the notion of induction and deduction.

### **27.5 Induction Versus Deduction**

Bayesianism is based on inductive logic (Howson 1991; Howson et al. 1993; Chalmers 1999; Jaynes 2003; Gelman et al. 2004), although some argue that it is based both on induction and deduction (Gelman and Shalizi 2013). Given the above consideration (and limitations) of experiments (in a scientific context) and data (in a UQ context), the question now arises on how to derive theories from these observations. Scientific experimentation, modeling, studies often rely on a logic to make certain claims. Induction and deductions are such kinds of logic. What such logic offers, is a connection between premises and conclusions:


This logical deduction is obvious, but such logic only establishes a connection between premises 1 and 2 and the conclusion 3, it does not establish the truth of any of these statements. If that would be the case, then also:


is equally "logic". The broader question therefore is if scientific theories can be derived from observations. The same question occurs in the context of UQ: can models be derived from data. Consider an experiment in a lab doing a set of n experiments.

Premises:


…

20. The reservoir rock is water-wet in sample 20.

Conclusion: the reservoir is water-wet (and hence not oil-wet).

This simple idea is mimicked from Bertrand Russel's Turkey argument (in his case it was a chicken). "I (the turkey) am fed at 9 am" day after day, hence "I am always fed at 9 am", until the day before Thanksgiving (Chalmers 1999). Another form of induction occurred in 1907: "But in all my experience, I have never been in any accident … of any sort worth speaking about. I have seen but one vessel in distress in all my years at sea. I never saw a wreck and never have been wrecked nor was I ever in any predicament that threatened to end in disaster of any sort. (E. J. Smith 1907, Captain, RMS Titanic)".

Any model or theory derived from observations can never be proven in the sense as being derived from it (David Hume).

This does not mean that induction (deriving models from observations) is completely useless. Some inductions are more warranted than others. Specifically, in the case when the observations set is "large", performed and under a "wide variety of conditions", although these qualitative statements depend clearly on the specific case. "When I swim with hungry sharks, I get bitten", needs really be asserted only once.

The second qualification (variety of conditions) requires some elaboration because we will return to it when discussing Bayesianism. Which conditions are being tested is important (the age of the driller for example is not), hence in doing so we rely on some prior knowledge of the particular model or theory being derived. Such prior knowledge will determine which factors will be studied, which are influencing the theory/model and which not. Hence the question is to how this "prior knowledge" itself is asserted by observations. One runs into the never-ending chain of what prior knowledge is used to derive prior knowledge. This point was made clear by David Hume, an eighteenth-century Scottish philosopher (Hume 2000, originally 1739). Often the principle of induction is argued because it has "worked" from experience. The reader needs simply to replace the example of the water-wet rocks with "Induction has worked in case *j*" etc.… to understand that induction is, in this way, "proven" by means of induction. The way out of this "mess" is to not make true/false statements, but to use induction in a probabilistic sense (probably true), a point to which we will return when addressing Bayesianism.

### **27.6 Falsificationism**

### **A Reaction to Induction**

Falsificationism, as championed by Karl Popper (1959) starting in the 1920s was born partly as a reaction to inductionism (and logical positivism). Popper claimed that science should not involve any induction (theories derived from observations). Instead, theories are seen as speculative or tentative, as created by the human intellect, usually to overcome limitations of previous theories. Once stated, such theories need to be tested rigorously with observations. Theories that are inconsistent with such observation should be rejected (falsified). The theories that survive are the best theories, currently. Hence, falsificationism has a time component and aims to describe progress in science, where new theories are born out of old ones by a process of falsification.

In terms of UQ, one can then see models not as true representations of actual reality but as hypotheses. One has as many hypotheses as models. Such a hypothesis can be constrained by previous knowledge, but real field data should be used not to confirm a model (it confirms this with data) but to falsify a model (reject, the model does not confirm with data). A simple example illustrates the difference:

### *Induction*: Premise: All rock samples are sandstones. Conclusion: The subsurface system contains only sandstone. *Falsification*: Premise: A sample has been observed that is shale. Conclusion: The subsurface system does not consist just of sandstone.

The latter is clearly a logically valid deduction (true). Falsification therefore can only proceed with hypotheses that are falsifiable (this does not mean that one has to falsify the observations, but that such observation could exist). Some hypotheses are not falsifiable; for example, "the subsurface system consists of rock that are sandstone or not sandstone". This then raises the question of the degree of falsifiability of a hypothesis and the strength (precision) of the observation in falsifying. Not all hypotheses are equally falsifiable and not all observations should be treated on the same footing. A strong hypothesis is one that makes strong claims, there is a difference between:


Clearly 2 has more consequences than 1. Falsification therefore invites stating bold conjectures rather than safe conjectures. Science advances through a large number of bold conjectures that would be easily falsifiable. As a result, a hypothesis *B* that is offered after hypothesis *A* should also be more falsifiable.

The latter has considerable implications in UQ and model building. Inductionists tend to bet on one model, the best possible, best explaining most observations, within a static context, without the idea that the model they are building will evolve. Inductionists do evolve models, but that is not the outset of their viewpoint, there is always the hope that the best possible will remain the best possible. The problem with this inductionist attitude is that new observations that cannot be fitted into the current model are used to "fix" the model with ad hoc modifications. A great example of this can be found in the largest oil reservoir in the world, namely the Ghawar field (see *Twilight in the Desert: The Coming Saudi Oil Shock and the World Economy*, Matt Simmons). Before 2000, most modelers (geologists, geophysicist, engineers) did not consider fractures as being a driving heterogeneity for oil production. However, flow meter observations in wells indicated significant permeability. To account for this data, the existing models with already large permeabilities (1000–10.000mD) where modified to 200D, see Fig. 27.5. While this dramatic increase in permeability in certain zones did lead to explaining the flow meter data, the ad hoc modification cannot be properly tested with the current observations. It is just a fix to the model (the current "theory" of no fractures). Instead, a new test would be needed, such as new drilling to confirm or not the presence of a gigantic cave that can explain such ridiculous permeability values. Today, all models built of the Ghawar field contain fractures.

**Fig. 27.5** A reservoir model developed to reflect super permeability channels; note the legend with permeability values (Valle et al. 1993)

Falsificationism does not use ad hoc modification, because the ad hoc modification cannot be falsified. In the Ghawar case, the very notion of fluid flow by means of large matrix permeability tells the falsificationist that bold alternative modifications to the theory are needed and not simple ad hoc fixes, in the same sense that science does not progress by means of fixes. An alternative therefore to the inductionist approach in Ghawar could be as follows: most fluid flow is caused by large permeability, except in some area where it is hypothesized that fractures are present despite the fact that we have not directly observed then. The falsificationist will now proceed by finding the most rigorous (new) test to test this hypothesis. This could consist of acquiring geomechanical studies of the system (something different than flow) or by means of geophysical data that aims to detect fractures (AVOZ data). New hypotheses also need to lead to new tests that can falsify them. This is how progress occurs. The problem is often "time"; a falsificationist takes the path of high risk, high gain, but time may run out on doing experiments that falsify certain hypothesis. "Failures" are often seen as that and not as lessons learned. In the modeling world one often shies away from bold hypothesis (certainly if one wants to obtain government research funding!) and that modelers, as a group tends to gravitate towards some consensus under the banner of being good at "team-work". It is the view of the authors that such practice is however the death of any realistic UQ. UQ needs to include bold hypothesis, model conjectures that are not the norm, or based on any majority vote, or by playing it safe, being conservative. Uncertainty cannot be reduced by just great team-work, it will require equally rigorous observations (data) that can falsify any (preferably bold) hypothesis.

This does not mean that inductionist type of modeling and falsification type of modeling cannot co-exist. If inductionism leads to cautious conjectures and falsification leads to bold conjectures. Cautious conjectures may carry little risk, and hence, if they are falsified, then insignificant advance is made. Similarly, if bold conjectures cannot be falsified with new observations, significant advance is made. The matter that is important in all this however is the nature of the background knowledge (recall, the prior knowledge), what is currently known about what is being studied. Any "bold" hypothesis is measured against such background knowledge. Likewise, the degree to which observations can falsify hypothesis needs to be measured against such knowledge. This background knowledge changes over time (what is bold in 2000 may no longer be bold in 2020), and such change, as we will discuss is explicitly modeled in Bayesianism.

### **Falsificationism in Statistics**

Schools of statistical inference are sometimes linked to the falsificationist views of science, in particular the work of Fischer, Neyman and Pearson; all well-known scientists in the field of (frequentist) statistics (Fisher and Fisher 1915; Fisher 1925; Rao 1992; Pearson et al. 1994; Berger 2003; Fallis 2013 for overviews and original papers). Significance tests, confidence intervals *p*-values are associated with a hypothetico-deductive way of reasoning. Since these methods are pervasive in all areas of science, particularly in UQ, we present some discussion on its rationality as well as the opposing views of inductionism within this context.

Historically, Fisher can be seen as the founder of classical statistics. His work has a falsificationist foundation, steeped in statistical "objectivity" (lack of necessary subjective assumption, which is the norm in Bayesian methods). The now well-known procedure starts by stating a null-hypothesis (a coin is fair), then defines an experiment (flipping), a stopping rule (e.g. number of flips) and a test-statistic (e.g. number of heads). Next, the sampling distribution (each possible value of the test-statistic), assuming the null-hypothesis is true, is calculated. Then, we calculate a probability *p* that our experiment falls in an extreme group (e.g. 4 heads or less which hypothesis has only a probability of 1.2% for 20 flips). Then a convention is taken to reject (falsify) the hypothesis when the experiment falls in the extreme group, say *p* ≤0.05.

Fisher's test works only on isolated hypotheses, which is not how science progresses; often many competing hypotheses are proposed that require testing under some evidence. Neyman and Pearson developed statistical methods that involve rival hypotheses, but again reasoning from an "objective" perspective, without relying on priors or posteriors of Bayesian inductive reasoning. For example, in the case of two competing hypotheses *H*<sup>1</sup> and *H*2, Neyman-Pearson reasoned that either of the hypotheses are accepted or rejected, leading to two kinds of errors (stating that one is false, while the other is false and vice versa), better known as type I and II errors. Neyman and Pearson improved on Fischer in better defining "low probability". In the coin example, a priori, any combination of 20 tosses has a probability of 2 <sup>−</sup>20, even under a fair coin, most tosses have small probability. Neyman-Pearson provide some more definition of this critical region (where hypotheses are rejected). If *X* is the random variable describing the outcome (e.g. a combination of tosses), then the outcome space is defined by the following inequality:

$$L(X) = \frac{P(X|H\_1)}{P(X|H\_2)} \le \delta \quad P(L(X) \le \delta | H\_1) = a \tag{27.1}$$

with *δ* depending on the significance level *α* and the nature of the hypothesis. This theorem known as the Fundamental Lemma (Neyman and Pearson 1933) defines the most powerful test to reject *H*<sup>1</sup> in favor of *H*<sup>2</sup> at significance level *α* for a threshold *δ*. The interpretation of likelihood ratio was provided by Bayesianists as the Bayes' factor (the evidential force of evidence). This was however not the interpretation of Neyman-Pearson, who rejected subjective models.

What then does a significance test tell us about the truth (or not) of a hypothesis? Since the reasoning here is in terms of falsification (and not induction), the Neyman-Pearson interpretation is that if a hypothesis is rejected, then "one's actions should be guided by the assumption that it is false" (Lindgren 1976). Neyman-Pearson gladly admit that significance tests tell nothing about whether a hypothesis is true or not. However, they do attach the notion of "in the long run", interpreting the significance level as, for example, the number of times in 1000 times that the same test is being done. The problem here is that no testing can be done and will be done in exactly the same fashion, under the exact same circumstances. This idea would also invoke the notion that under a significance level of 0.05, a *true* hypothesis would be rejected with a probability of 0.05. The latter violates the very reason on which significance tests were formed: events with probability *p* can never be proven to occur (that requires subjectivity!), let alone with the exact frequency of *p*.

The point here is to show that classical statistics should not be seen as purely falsificationist, a logical hypothetic-deductive way of reasoning. Reasoning in classical statistics comes with its own subjective notions of personal judgements (choosing which hypothesis, what significance level, stopping rules, critical regions, iid assumptions, Gaussian assumptions etc. …). This was in fact later acknowledged by Pearson himself (Neyman and Pearson 1967, p. 277).

### **Limitations of Falsificationism**

Falsificationism comes with its own limitations. Just as induction cannot be induced, falsificationism cannot be falsified, as a theory. This becomes clearer when considering real-world development of models or theories. The first problem is similar to the one discussed in using inductive and deductive logic. Logic only works if the premises are true, hence falsification, as a deductive logic cannot distinguish between a faulty observation and a faulty hypothesis. The hypothesis does not have to be false when inconsistent with observations, since observations can be false. This is an important problem in UQ that we will revisit later.

The real world involves considerably more complication than "the subsurface system is deltaic". Let's return to our example of monitoring heat storage using geophysics. A problem that is important in this context is to monitor whether the heat plume remains near the well and is compact, so that it does not start to disperse, since then recovery of that heat becomes less efficient. A hypothesis could then be "the heat plume is compact", geophysical data can be used to falsify this by, for example, observing that the heat plume is indeed influenced by heterogeneity. Unfortunately, such data does not directly observe "temperature", instead it measures resistivity, which is related to temperature and other factors. Additionally, because monitoring is done at a distance from the plume (at the surface), the issue of limited resolution occurs (any "remote sensing" suffers from this limited resolution). This is then manifested in the inversions of the ERT data into temperature, since many inversion techniques result in smooth versions of actual reality (due to this limited resolution issue), from which the modeler may deduce that homogeneity of the plume is not falsified. How do we find where the error lies? In the instrumentation? In the instrumentation set-up? In the initial and boundary conditions that are required to model the geophysics? In the assumptions about geological variability? In the smoothness of the inversion? Falsification does not provide a direct answer to this. In science, this problem is better known as the Duhem–Quine thesis after Pierre Duhem and Willard Quine (Ariew 1984). This thesis states that it is impossible to falsify a scientific hypothesis in isolation, because the observations required for such falsification themselves rely on additional assumptions (hypothesis) than cannot be falsified separately from the target hypothesis (or vice versa). Any particular statistical method that claims to do so, ignores the physical reality of the problem.

A practical way to deal with this situation is not consider just falsification, but sensitivity to falsification. What impacts the falsification process? Sensitivity, even with limited or approximate physical models provide more information that can lead to (1) changing the way data is acquired (the "value of information") changing the way the physics of the problem (e.g. the observations) is modeled by focusing on what matters most towards testing the hypothesis.

More broadly, falsification does not really follow the history of the scientific method. Most science has not been developed by means of bold hypothesis that are then falsified. Instead, theories that are falsified are carried through history; most notably, because observations that appear to falsify the theory can be explained by means of causes other than the theory that was the aim of falsification. This is quite common in modeling too: observations are used as claims that a specific physical model does not apply, only to discover at a later time that the physical model was correct but that the data could be explained by some other factor (e.g. a biological reason, instead of a physical reason). Popper himself acknowledged this dogmatism (hanging onto models that have "falsified" to "some degree"). As we will see later, one of the problems in the application of probability (and Bayesianism) is that zero probability models are deemed "certain" not to occur. This may not reflect the actual reality that models falsified under such Popper-Bayes philosophy become "unfalsified" later by new discoveries and new data. Probability and "Bayesianism" are not at fault here, but the all too common underestimation of uncertainties in many applications.

### **27.7 Paradigms**

### **Thomas Kuhn**

From the previous presentation, one may argue that both induction and falsification provide too much of a fragmented view of the development of scientific theory or methods that often do not agree with reality. Thomas Kuhn, in his chapter "The Structure of Scientific Revolution" (Kuhn 1996) emphasizes the revolutionary character of scientific methods. During such revolution one abandons one "theoretical" concept for another, which is incompatible with the previous one. In addition, the role of scientific communities is more clearly analyzed. Kuhn describes the following evolution of science:

paradigm → crises → revolution → new paradigm → new crisis.

Such a single paradigm consists of certain (theoretical) assumptions, laws, methodologies and applications adapted by members of a scientific community. Probabilistic methods, or Bayesian methods, can be seen as such paradigms: they rely on axioms of probability and the definition of a conditional probability, the use of prior information, subjective beliefs, maximum entropy, principle of indifference, algorithms of McMC, etc. … Researchers within this paradigm do not question the fundamentals of such paradigm, the fundamental laws or axioms. Activities within the paradigm are then puzzle-solving activities (e.g. studying convergence of a Markov chain) governed by the rules of the paradigm. Researchers within the paradigm do not criticize the paradigm. It is also typical that many researchers within that paradigm are unaware of the criticism on the paradigm or ignorant as to the exact nature of the paradigm, simply because it is a given: who is really critical of the axioms of probability when developing Markov chain samplers? Or, who questions the notion of conditional probability when performing stochastic inversions? Puzzles that cannot be solved are deemed to be anomalies, often attributed to the lack of understanding of the community about how to solve the puzzle within the paradigm, rather than a question about the paradigm itself. Kuhn considers such unsolved issues as anomalies rather than what Popper would see as potential falsifications of the paradigm. The need for greater awareness and articulation of the assumptions of a single paradigm becomes necessary when the paradigm requires defending against offered alternatives.

Within the context of UQ, a few such paradigms have emerged reflecting the concept of revolution as Kuhn describes. The most "traditional" of paradigms for quantifying uncertainty is by means of probability theory and its extension of Bayesian probability theory (the addition of a definition of conditioning). We provide here a summary account of the evolution of this paradigm, the criticism leveled, the counter-arguments and the alternatives proposed, in particular possibility theory.

### **Is Probability Theory the Only Paradigm for Uncertainty Quantification? The Axioms of Probability: Kolmogorov—Cox**

The concept of numerical probability emerged in the mid-seventeenth century. A proper formalization was developed by (Kolmogoroff 1950) based on classical measure theory. A comprehensive study of its foundations is offered in Fine (1973). The treatment is vast and comprises many works of particular note (Gnedenko et al. 1962; Fine 1973; de Finetti 1974, 1995; de Finetti et al. 1975; Jaynes 2003; Feller 2008). Also of note is the work of (Shannon 1948) on uncertainty-based information in probability. In other words, the concept of probability has been around for three centuries. What is probability? It is now generally agreed (the fundamentals of the paradigm) that the axioms of Kolmogorov form the basis, as well as the Bayesian interpretation by Cox (1946). Since most readers are unfamiliar with the Cox theorem and the consequences for interpreting probability, we provide some high-level insight.

Cox works from a set of postulates for example (we focus on just two of three postulates)


$$plaus(p \lor q) = f(plaus(p), plaus(q|p)).$$

The traditional laws are recovered when setting *plaus* to be a probability measure or *P* or stating as per the Cox theorem "any measure of belief is isomorphic to a probability measure". This seems to suggest that probability is sufficient in dealing with uncertainty, nothing else is needed (due to this isomorphism). The consequence is that one can now perform calculations (a calculus) with "degrees of belief" (subjective probabilities) and even mix probabilities based on subjective belief with probabilities based on frequencies. The question is therefore whether these subjective probabilities are the only legitimate way of calculating uncertainty? For one, probability requires that either the fact is there, or it is not there, nothing is left in the "middle". This then necessarily means that probability is ill-suited in cases where the excluded middle principle of logic does not apply. What are those cases?

#### **Intuitionism**

Probability theory is truth driven. An event occurs or does not occur. The truth will be revealed. From a hard scientific, perhaps engineering approach this seems perfectly fine, but it is not. A key figure in this criticism is the Dutch mathematician and philosopher Jan Brouwer. Brouwer founded the mathematical philosophy of intuitionism countering the then-prevailing formalism, in particular of David Hilbert as well as of Bertrand Russell, claiming that mathematics can be reduced to logic; the epistemological value of mathematical constructs lies in the fundamental nature of this logic.

In simplistic terms perhaps, intuitionists do not accept the law of excluded middle in logic. Intuitionism reasons from the point that science (in particular mathematics) is the result of the mental construction performed by humans rather than principles founded in the actual objective reality. Mathematics is not "truth", rather it constitutes applications of internally consistent methods used to realize more complex mental constructs, regardless of their possible independent existence in an objective reality. Intuition should be seen in the context of logic as the ability to acquire knowledge without proof or without understanding how the knowledge was acquired.

Classic logic states that existence can be proven by refuting non-existence (the excluded middle principle). For the intuitionist, this is not valid; negation does not entail falseness (lack of existence), it entails that the statement is refuted (a counter example has been found). For an intuitionist a proposition *p* is stronger than a statement of not (not *p*). Existence is a mental construction, not proof of non-existence. One specific form and application of this kind of reasoning is fuzzy logic.

#### **Fuzzy Logic**

It is often argued that epistemic uncertainty (or knowledge) does not cover all uncertainty (or knowledge) relevant to science. One such particular form of uncertainty is "vagueness" which is borne out of the vagueness contained in language (note that other language dependent uncertainties exists such as "context-driven"). This may seem rather trivial to someone in the hard sciences, but it should be acknowledged that most language constructs ("this is air", meaning 78% nitrogen, 21% oxygen, and less than 1% of argon, carbon dioxide, and other gases) are a purely theoretical construct, of which we still may not have incomplete understanding. The air that is outside is whatever that substance is, it does not need human constructs, unless humans use if for calculations, which are themselves constructs. Unfortunately (possibly flawed) human constructs is all that we can rely on.

The binary statements "this is air" and "this is not air" are again theoretical human constructs. Setting that aside, most of the concepts of vagueness are used in cases with unclear borders. Science typically works with classification systems ("this is a deltaic deposit", "this is a fluvial deposit"), but such concepts are again man-made constructs. Nature does not decide to "be fluvial", it expresses itself through laws of physics, which are still not fully understood.

A neat example presents itself in the September 2016 edition of EOS: "What is magma?" Most would think this is a problem which has already been solved, but it isn't, mostly due to vagueness in language and the ensuing ambiguity and difference in interpretation by even experts. A new definition is offered by the authors: "*Magma*: naturally occurring, fully or partially molten rock material generated within a planetary body, consisting of melt with or without crystals and gas bubbles and containing a high enough proportion of melt to be capable of intrusion and extrusion."

Vague statements ("this may be a deltaic deposit") are difficult to capture with probabilities (it is not impossible, but quite tedious and construed). A problem occurs in setting demarcations. For example, in air pollution, one measures air quality using various indicators such as PM2.5, meaning particles which pass through a size-selective inlet with a 50% efficiency cut-off at 2.5 μm aerodynamic diameter. Then standards are set, using a cut-off to determine what is "healthy" (a green color) and what is "not so healthy" (orange color) and "unhealthy" (a red color) (the humorous reader may also think of terrorist alert levels). Hence, if the particular matter changes by one single particle, the air goes suddenly from "healthy" to "not so healthy"?

In several questions of UQ, both epistemic and vagueness-based uncertainty may occur. Often vagueness uncertainty exists at a higher-level description of the system, while epistemic uncertainty may then deal with questions of estimation because of limited data within the system. For example, policy makers in the environmental sciences may set goals that are vague, such as "should not exceed critical levels". Such a vague statement then needs to be passed down to the scientist who is required to quantify risk of attaining such levels by means of data and numerical models, where epistemic uncertainty comes into play. In that sense there is no need to be rigorously accurate, for example according to a very specific threshold, given the above argument about such thresholds and classification systems.

Does probability easily apply to vagueness statements? Consider a proposition "the air is borderline unhealthy". The rule of the excluded middle no longer applies because we cannot say that the air is either not unhealthy or unhealthy. Probabilities no longer sum to one. It has therefore been argued that the propositional logic of probability theory needs to be replaced with another logic: fuzzy logic (although other logics have been proposed such as intuitionistic, trivalent logic, we will limit the discussion to this one alternative).

Fuzzy logic relies on fuzzy set theory (Zadeh 1965, 1975, 2004). An example of fuzzy set *A* such as "deltaic" is said to be characterized by a membership function *μdeltaic*ð Þ*u* representing the degree of membership given some information *u* on the deposit under study, for example *μdeltaic*ð Þ *deposit* = 0.8 for a deposit with info *u* under study. Probabilists often claim that such membership function is nothing more than a conditional probability *PAu* ð Þ j in disguise (Loginov 1966). The link is made using the following mental construction. Imagine 1000 geologists looking at the same limited info *u* and then voting whether the deposit is "deltaic" or "fluvial". Let's assume these are the two options available. *<sup>μ</sup>deltaic*ð Þ *deposit* = 0.832 means that 832 geologists picked "deltaic" and hence a vote picked at random has 83.2% chance of being deltaic. However, the conditional probability comes with its limitations as it attempts to cast a very precise answer into what is still a very vague concept. What really is "deltaic"? Deltaic is simply a classification made by humans to describe a certain type of depositional system subject to certain geological processes acting on it. The result is a subsurface configuration, termed architecture of clastic sediments. In modeling subsurface systems, geologists do not observe the processes (the deltaic system) but only the record of it. However, there is still no full agreement as to what is "deltaic" or when "deltaic" ends and "fluvial" starts as we go more upstream? (Recall our discussion on "magma") What are the processes which are actually happening and how all this gets turned into a subsurface system? Additionally, geologist may not have a consensus on what "deltaic" is, where "fluvial" starts, or, may classify based on personal experiences, different education (schools of thought about "deltaic"), and different education levels. What then does 0.832 really mean? What is the meaning of the difference between 0.832 and 0.831? Is this due to education? Misunderstanding or disagreement on the classification? Lack of data provided? It clearly should be a mix of all this, but probability does not allow an easy discrimination. We find ourselves again with a Duhem–Quine problem.

Fuzzy logic does not take the binary route of voting up or down, but allows a grading in the vote of each member, meaning that it allows for more gradual transition between the two classes for each vote. Each person takes the evidence at his/her value and makes a judgement based on their confidence and education level: I don't really know, hence 50/50; I am pretty certain, hence 90/10. (More advanced readers in probability theory may now see a mixture of the models of probability stated based on the evidence of what the *u* is. However, because of the overlapping nature of how evidence is regarded by each voter, these prior probabilities are no longer uniform).

#### **The Dogma of Precision**

Clearly probability theory (randomness) does not work well when the event itself is not clearly defined, subject to discussion. Probability theory does not support the concept of a fuzzy event, hence such information (however vague and incomplete) becomes difficult and non-intuitive to account for. Probability theory does not provide a system for computing with fuzzy probabilities expressed as likely, unlikely and not very likely. Subjective probability theory relies on the elicitation rather than the estimation of a fuzzy system. It cannot address questions of the nature "What is the probability that the depositional system *may* be deltaic". One should question, under all this vagueness and ambiguity what is really the meaning of the digit "2" or "3" is in *PAu* ð Þ <sup>j</sup> = 0.832. The typical reply of probabilists to possibilists is to "just be more precise" and the problem is solved. But this would ignore a particular form of lack of understanding, which goes to the very nature of UQ. Precision is required that does not agree with the realism of vagueness on concepts, which are as yet imprecise (such as in subsurface systems).

The advantage and the disadvantage of the application of probability to UQ are that, dogmatically, it requires, precision. It is an advantage in the sense that it attempts to render subjectivity into quantification, that the rules are very well understood, the methods deeply practiced, because of the nature of the rigor of the theory, the community (of 300 years of practice) is vast. But, this rigor does not always jive with reality. Reality is more complex than "Navier Stokes" or "Deltaic", so we apply rigor to concepts (or even models) that probably deviate considerably from the actual processes occurring in nature. Probabilists often call this "structural" error (yet another classification and often ambiguous concept, because it has many different interpretations) but provide no means of determining what exactly this is and how it should be precisely estimated, as is required by their theories. It is left as a "research question", but can this question be truly answered within probability theory itself? For the same reasons, probabilistic method (in particular Bayesian, see the following sections are computationally very demanding, exactly because of this dogmatic quest for precision.

### **Possibility Theory: Alternative or Compliment?**

Possibility theory has been popularized by Zadeh (1978), also by Dubois and Prade (1990). The original notion goes back further to the economist (Shackle 1962) studying uncertainty based on degrees of potential surprise of events. Shackle also introduces the notion of conditional possibility (as opposed to conditional probability). Just as probability theory, possibility theory has axioms. Consider Ω to be a finite set, with subsets *A* and *B* that are not necessarily disjoint:

axiom 1: *pos*ð Þ ∅ = 0 ðΩ is exhaustive)

axiom 2: *pos*ð Þ Ω = 1 (no contradiction)

axiom 3: *pos A*ð Þ <sup>∪</sup> *<sup>B</sup>* = maxð Þ *pos A*ð Þ, *pos B*ð Þ ("additivity")

Noticeable difference with probability theory is that addition is replaced with "max" and the subsets for axiom 3 need not be disjoint. Additionally, probability theory uses a single measure, the probability, whereas possibility theory uses two concepts, the possibility and the necessity of the event. This necessity, another measure is defined as:

$$\operatorname{inc}(A) = 1 - \operatorname{pos}(\bar{A})\tag{27.2}$$

If the complement of an event is impossible, then the event is necessary. *nec A*ð Þ = 0 means that *<sup>A</sup>* is unnecessary. One should not be "surprised" if *<sup>A</sup>* does not occur, it says nothing about *pos A*ð Þ.*nec A*ð Þ= 1 means that *A* is certainly true, which implies *pos A*ð Þ= 1. Hence *nec* carries a degree of surprise: *nec A*ð Þ= 0.1 a little bit surprised, *nec A*ð Þ= 0.9 very surprised if *A* is not true. Possibility also allows for indeterminacy (which is not allowed in epistemic uncertainty), this is captured by *nec A*ð Þ = 0, *pos A*ð Þ= 1.

Logically then

$$\operatorname{inc}(A \cap B) = \min(\operatorname{inc}(A), \operatorname{inc}(B))\tag{27.3}$$

Possibility does not follow the rule of the excluded middle because

$$
\rho \cos(A) + \rho \cos(\bar{A}) \ge 1\tag{27.4}$$

Take the following example. Consider a reservoir. It either contains oil ð Þ *A* or contains no oil *<sup>A</sup>*̄ ð Þ (something we like to know!). *pos A*ð Þ= 0.5 means that I am willing to bet that the reservoir contains oil so long as the odds are even *or better*. I would not bet that it contains oil. Hence this describes a degree of belief very different from subjective probabilities.

Possibilities are sometime called "imprecise probabilities" (Hand and Walley 1993) or are interpreted that way. "Imprecise" need not be negative, as discussed above, it has its own advantages, in particular in terms of computation. In probability theory, information is used to update degrees of belief. This is based on Bayes' rule whose philosophy will be studied more closely in the next section. A counterpart to Bayes' rule exists in possibility theory, but because of the imprecision of possibilities over probabilities, no unique way exists to update possibilities into a new possibility, given new (vague) information. Recall that Bayes' rule relies on the product (corresponding to a conjunction in classical logical)

$$P(A|B) = \frac{P(B|A)}{P(B)}P(A) \tag{27.5}$$

Consider first the counterpart of the probability density function *fX*ð Þ*<sup>x</sup>* in possibility theory: namely the possibility distribution *πX*ð Þ*x* . Unlike probability densities which could be inferred from data, possibility distributions are always specified by users, and hence take simple form (constant, triangular) functions. Densities express likelihoods, a ratio of the densities assessed in two outcomes denotes how much more (or less) likely one outcome is over the other. A possibility distribution simply states how possible an outcome *x* is. Hence a possibility distribution is always equal or less than unity (not the case for a density). Also, note that *P X*ð Þ = *x* = 0, always if *X* is a continuous variable, while *pos X*ð Þ = *x* is not zero everywhere. Similarly, in the case of a joint probability distribution, we can define a joint possibility distribution as *π<sup>X</sup>*, *<sup>Y</sup>* ð Þ *x*, *y* and conditional possibility distributions as *πX Y*<sup>j</sup> ð Þ *x y*j . The objective now is to infer *πX Y*<sup>j</sup> ð Þ *x y*j from *πY X*<sup>j</sup> ð Þ *y x*j and *πX*ð Þ*x* .

As mentioned above, probability theory relies on a logical conjunction, see Fig. 27.6. This conjunction has the following properties:

$$\begin{aligned} a \cap b &= b \cap a \quad (\text{commutativity})\\ \text{if } a \le a^{\cdot} \text{ and } b \le b^{\cdot} \text{ then } a \cap b &\le a^{\cdot} \cap b^{\cdot} \quad (\text{monotonicity})\\ (a \cap b) \cap c &= a \cap (b \cap c) \quad (\text{associativity})\\ a \cap 1 &= a \quad (\text{neutrality}) \end{aligned}$$

Possibility theory, as it is based on fuzzy sets, rather than random sets, relies on an extension of the conjunction operation. This new conjunction is termed a



$$T\_1(a,b) = \min(a,b);\ T\_2(a,b) = a\text{.}\ b;\ \ T\_2(a,b) = \frac{ab}{a+b-ab}$$

**Fig. 27.6** Example of t-norms for conjunction operations

triangular norm (T-norm) (Jenei and Fodor 1998; Höhle 2003; Klement et al. 2004) because it follows the following four properties:

$$\begin{aligned} T(a,b) &= T(b,a) \quad \text{(commutativity)}\\ \text{if } a \le a' \text{ and } b \le b' \text{ then } T(a,b) &= T(a',b') \text{(monotonicity)}\\ T(a,T(b,c)) &= T(T(a,b),c) \quad \text{(associativity)}\\ T(a,1) &= a \quad \text{(neutrality)} \end{aligned}$$

Recall that Cox relied on the postulate that *plaus p*ð Þ ∩ *q* = *f plaus p* ð ð Þ, *plaus q p* ð ÞÞ j . Similarly, possibility theory relies on:

$$
\pi\_{Y|X}(\mathbf{y}|\mathbf{x}) = T\left(\pi\_X(\mathbf{x}), \pi\_{Y|X}(\mathbf{y}|\mathbf{x})\right) = T\left(\pi\_Y(\mathbf{x}), \pi\_{X|Y}(\mathbf{x}|\mathbf{y})\right) \tag{27.6}
$$

For example, for the minimum triangular norms we get

$$\pi\_{X|Y}(\mathbf{x}|\mathbf{y}) = \begin{cases} 1 & \text{if } \pi(\mathbf{x}) = \min\left\{\pi(\mathbf{x}), \pi\_{Y|X}(\mathbf{x}|\mathbf{y})\right\} \\ \min\left\{\pi(\mathbf{x}), \pi\_{Y|X}(\mathbf{x}|\mathbf{y})\right\} & \text{if } \pi(\mathbf{x}) > \min\left\{\pi(\mathbf{x}), \pi\_{Y|X}(\mathbf{x}|\mathbf{y})\right\} \end{cases} \tag{27.7}$$

and for the product triangular norm, we get something that looks Bayesian

$$
\pi\_{X|Y}(\mathbf{x}|\mathbf{y}) = \frac{\pi\_{Y|X}(\mathbf{x}|\mathbf{y})\pi(\mathbf{x})}{\pi(\mathbf{y})} \tag{27.8}
$$

### **27.8 Bayesianism**

#### **Thomas Bayes**

Uncertainty quantification, today often has a Bayesian flavor. What does this mean? Most researchers simply invoke Bayes' rule, as a theorem within probability theory. They work within the paradigm. But what is really the paradigm of Bayesianism? It can be seen as a simple set of methodologies, but it can also be regarded as a philosophical approach to doing science, in the same sense as empiricism, positivism, falsificationism or inductionism. The reverend Bayes' would perhaps be somewhat surprised by the scientific revolution and main stream acceptance of the philosophy based on his rule.

Thomas Bayes was a statistician, philosopher and Reverend. Bayes presented a solution to the problem of inverse probability in "An Essay towards Solving a Problem in the Doctrine of Chances". This essay was read after his death, by Richard Price for the Royal Society of London, a year after his death. Bayes' theorem remained in the background until reprinted in 1958, and even then it took a few more decades before an entirely new approach to scientific reasoning, Bayesianism was created (Howson et al. 1993; Earman 1992).

Prior to Bayes' most works on chance were focused on direct inference, such as the number of replications needed to calculate a desired level of probability (how many flips of the coin are needed to assure 50/50 chance?). Bayes' treated the problem of inverse probability: "given the number of times an unknown event has happened and failed: required the chance that the probability of its happening in a single chance lies between any two degrees of probability that can be named" (see the Biometrika publication of Bayes' essay). Bayes' essay has essentially four parts. Part 1 consists of a definition of probability and some basic calculation which are now known as the axioms of probability. The second part uses these calculations in a chance event related to a perfectly leveled billiard table, see Fig. 27.7. Part 3 consists of using the equations obtained from the analysis of the billiard problem to his problem of inverse probability. Part 4 consists of more numerical studies and applications.

Bayes, in his essay, was not concerned with induction and the role of probability in it. Price, however, in the preface to the essay did express a wish that the work would in fact lead to a more rational approach to induction than was then currently available. What is perhaps less known is that "Bayes' theorem" in the form that we now know it, was never written by Bayes'. However, it does occur in the solution to his particular problem. As mentioned above, Bayes' was interested in a chance event with unknown probability (such as in the billiard table problem), given a

**Fig. 27.7** Bayes' billiard table: "to be so made and leveled that if either of the ball O and W thrown upon it, there shall be the same probability that it rests upon any one equal part of the plane as another" (Bayes and Price 1763)

number of trials. If *M* counts the number of times that an event occurs in *n* trials, then the solution is given through the binomial distribution

$$P(p\_1 \le p \le p\_2 | M = m) = \frac{\int\_{p\_1}^{p\_2} \binom{n}{m} p^m (1-p)^{n-m} P(dp)}{\int\_0^1 \binom{n}{m} p^m (1-p)^{n-m} P(dp)} \tag{27.9}$$

where *P dp* ð Þ is the prior distribution over p. Bayes' insight here is to "suppose the chance is the same that it ð Þ*<sup>p</sup>* should lie between any two equi-different degrees". *P dp* ð Þ= *dp*, in other words the prior is uniform, leading to

$$P(p\_1 \le p \le p\_2 | M = m) = \frac{(n+1)!}{m!(n-m)!} \int\_{p\_1}^{p\_2} \binom{n}{m} p^m (1-p)^{n-m} dp \tag{27.10}$$

Why uniform? Bayes' does not reason from the current principle of indifference (which can be debated, see later), but rather from an operation characterization of an event concerning the probability which we know absolutely nothing about prior to the trials. The use of prior distributions however was one of the key insights of Bayes' that very much lives on.

#### **Rationality for Bayesianism**

Bayesians can be regarded more as relativists than absolutists (such as Popper). They believe in prediction based on imperfect theories. For example, they will take an umbrella on their weekend, if their ensemble Kalman filter prediction of the weather at their trip location puts a high (posterior) probability of rain in 3 days. Even if the laws involved are imperfect and probably can be falsified (many weather predictions are completely wrong!), they rely on continued learning from future information and adjustments. Instead of relying on Popper's zero probability (rejected or not), they rely more on an inductive inference yielding non-zero probabilities.

If we now take the general scientific perspective (and not the limited topic of UQ), then Bayesians see science progress by hypothesis, theories and evidence offered towards these hypotheses as all quantified using probabilities. In this general scientific context, we may therefore state hypothesis *H*, gather evidence *E*, with *PHE* ð Þ j the probability of the hypothesis in the light of the evidence, *PEH* ð Þ j the probability that the evidence occurs when the hypothesis is true, *P H*ð Þ the probability of the hypothesis without any evidence and *P E*ð Þ the probability of the evidence, without stating any hypothesis being true.

$$P(H|E) = \frac{P(E|H)}{P(E)}P(H) \tag{27.11}$$

*P H*ð Þ is also termed the prior probability and *PHE* ð Þ j the posterior probability. We provided some discussion on a logical way of explaining this theorem (Cox 1946) and the subsequent studies that showed this was not quite as logical as it seems (Halpern 1995, 2011). Few people today know that Bayesian probability has 6 axioms (Dupré and Tiplery 2009). Despite these perhaps rather technical difficulties, a simple logic underlies this rule. Bayes' theorem states that the extent to which some evidence supports a hypothesis is proportional to the degree to which the evidence is predicted by the hypothesis. If the evidence is very likely ("Sandstone has lower acoustic impedance than shale) then the hypothesis ("Acoustic impedance depends on mineral composition") is not supported significantly when indeed we measure that "Sandstone has lower acoustic impedance than shale". If, however, the evidence is deemed very unlikely, (e.g. "Shale has higher acoustic impedance than sandstone"), then the hypothesis of another theorem ("acoustic impedance depends not only on mineralization, but also fluid content") will be highly confirmed (have high posterior probability).

Another interesting concept is how Bayes deals with multiple evidences of the same impact on the hypothesis. Clearly, more evidence leads to an increase in the probability of a hypothesis supported by that evidence. But evidences of the same impact will have a diminishing effect. Consider that a hypothesis has as equal probability as some alternative hypothesis:

$$P(H) = 0.5$$

Now consider multiple evidence sources such that

$$P(H|E\_1) = 0.8; P(H|E\_2) = 0.8; P(H|E\_3) = 0.8;$$

Then according to a model of conditional independence and Bayes' theorem (Bordley 1982; Journel 2002; Clemen and Winkler 2007):

$$P(H|E\_2, E\_1) = 0.94; P(H|E\_3, E\_2, E\_1) = 0.98;$$

Compounding evidence leads to increasing probability of the hypothesis.

### **Objective Versus Subjective Probabilities**

In the early days of the development of Bayesian approaches, several general principles were stated under which researchers "should" operate, resulting in an "objective" approach to the problem of inference, in the sense that everyone is following that same logic. One such principle is the principle of maximum entropy (Jaynes 1957), of which the principle of indifference (Laplace) is a special case. Subjectivists do not see probabilities as objective (leading to prescribing zero probabilities to well-confirmed ideas). Rather, subjectivists (Howson et al. 1993) see Bayes' theorem as an objective theory of inference. Objective is the sense that *given* prior probabilities and evidence, posterior probabilities are calculated. In that sense, subjective Bayesian make no claim on the nature of the propositions on which inference is being made (in that sense, they are also deductive).

One interesting application of reasoning in this way results when disagreement occurs on the same model. Consider modeler A (the conformist) who assigns a high probability to some relatively well-accepted modeling hypothesis and low probability to some rare (unexpected) evidence. Consider modeler B (the skeptic) who assigns low probability to the norm and hence high probability to any unexpected evidence. Consequently, when the unexpected evidence occurs and hence is confirmed *PEH* ð Þ j = 1, then the posterior of each is proportional to 1 ̸*P E*ð Þ. Modeler A is forced to increase their prior more than the Modeler B. Some Bayesians therefore state that the prior is not that important as continued new evidence is offered. The prior will be "washed out" by cumulating new evidence. This is only true for certain highly idealized situations. It is more likely that two modelers will offer two hypotheses, hence evidence needs to be evaluated against each other. However, there is always a risk that neither model can be confirmed, regardless how much evidence is offered, hence the prior model space is incomplete, which is the exact problem of the objectivist Bayes. Neither objective nor subjective Bayes' addresses this problem.

### **Bayes with Ad Hoc Modifications**

Returning now to the example of Fig. 27.5. Bayesian theory, if properly applied allow for assessing these ad hoc model modifications. Consider that a certain modeling assumption *H* is prevailing in multi-phase flow: "oil flow occurs in rock with permeability of 10-10000 md" ð Þ *<sup>H</sup>* , now this modeling assumption is modified ad hoc to "oil flow occurs in rock with permeability of 10-10000md and 100-200D ð Þ *<sup>H</sup>* <sup>∩</sup> *AdHoc* . However, this ad hoc modification, under *<sup>H</sup>*, has very low probability, *P AdHoc* ð Þ≃0 and hence *P H*ð Þ ∩ *AdHoc* ≃ 0. The problem, in reality is that those making the ad hoc modification often do not use Bayesianism, hence never assess or use the prior *P AdHoc* ð Þ.

### **Criticism of Bayesianism**

What is critical to Bayesianism is the concept of "background knowledge". Probabilities are calculated based on some commonly assumed background knowledge. Recall that theories cannot be isolated and independently tested. This "background" consists of all the available assumptions tangent to the hypothesis at hand. The problem often resulting with using Eq. (27.11) is that such "background knowledge" *BK* is taken implicit:

$$P\_{BK\_0}(H|E) \simeq P\_{BK\_0}(E|H)P\_{BK\_0}(H) \to P\_{BK\_1}(H) \tag{27.12}$$

where 0 indicated at time *t* = 0. The posterior then includes the "new knowledge" which is included in the new background knowledge at the next stage *t* = 1. A problem occurs when applying this to the real world: what is this "background knowledge"? In reality, the prior and likelihood are not determined by the same person. For example, in our application, the prior may be given by a geologist, the likelihood by a data scientist. It is unlikely that they have the same "background knowledge" (or even agree on it). A more "honest" way of conveying this issue is to make the background knowledge explicit. Suppose that *BK*(1) is the background knowledge of person 1, who deals with evidence (the data scientist) then

$$P\left(H|E\cap BK^{(1)}\right) \simeq P\left(E\cap BK^{(1)}|H\right)P\left(H|BK^{(1)}\right) \tag{27.13}$$

Suppose *BK*(2) is person 2 (geologist) who provides the "prior", meaning provides background knowledge on his/her own, without evidence. Then, the new posterior can be written as

$$P\left(H|E\cap BK^{(1)}\cap BK^{(2)}\right) \simeq P\left(E\cap BK^{(2)}|H\right)P\left(H|BK^{(2)}\right)P\left(H|BK^{(1)}\right) \quad (27.14)$$

assuming however, there is no overlap between background knowledge. In practice, the issue that different components of the "system" (model) are done by different modelers with different background knowledge is ignored. Even if one would be aware of this issue, it would be difficult to implement in practice. The ideal Bayesian approach rarely occurs. No single person understands all the detailed aspects of the scientific modeling study at hand. A problem then occurs with dogmatism. The study in Fig. 27.5 illustrates this. Hypotheses that are given very high probability (no fractures) will remain high, particularly in the absence of strong evidence (low to medium *P*(*E*)). Bayes' rule will keep assigning very high probabilities to such hypotheses, particularly due to the dogmatic belief of the modeler or the prevailing leading idea of what is going on. This is not the problem of Bayes', but its common (faulty) application. Bayes' itself cannot address this.

More common is to select a prior hypothesis based on general principles or mathematical convenience, for example using a maximum entropy principle. Under such a principle, complete ignorance results in choosing for uniform distribution. In all other cases, one should pick the distribution that makes the least claims, from whatever information is currently available, on the hypothesis being studied. The problem here is not so much the ascribing of uniform probabilities but providing a statement of what all the possibilities are (on which then uniform probabilities are assigned). Who chooses these theories/models/hypotheses? Are those the only ones?

The limitation therefore of Bayesianism is that no judgment is leveled to the stated prior probabilities. Hence, any Bayesian analysis is as strong as the analysis of the prior. In subsurface modeling this prior is dominated by the geological understanding of the system. Such geological understanding and its background knowledge is vast, but qualitative. Later we will provide some ideas on how to make quantitative "geological priors".

### **Deductive Testing of Inductive Bayesianism**

The leading paradigm of Bayesianism is to subscribe to an induction from of reasoning: learning from data. Increasing evidence will lead to increasing probabilities of certain theories, models or hypothesis. As discussed in the previous section, one of the main issues lies in the statement of a prior distribution, the initial universe of possibilities. Bayesianism assume that a truth exists, that such truth is generated by a probability model, and also than any data/evidence are generated from this model. The main issue occurs when the truth is not even with the support (the range/span) generated by this (prior) probability model. The truth is not part of this initial universe. What happens then? The same goes when the error distribution on the data is chosen at too optimistic a level, in which case the truth may be rejected. Can we verify this? Diagnose this? Figure out whether the problem lies with the data or the model? Given the complexity of models, priors, data in the real world this issue may in fact go undiagnosed if one stops the analysis with the generation of the posterior distribution. Gelman and Shalizi (2013) discuss how mis-specified prior models (the truth is not in the prior) may result in either no solution, multi-model solutions to problems that are unimodal or complete non-sense.

Recent work (Mayo 1996) started to look at these issues. They attempt to frame these tests within classical hypothesis testing. Recall that classical statistics rely on a deductive form of hypothesis testing, which is very similar in flavor to Popper's falsification. In a similar vein, some form of model testing can be performed posterior to the generation of the posterior. Note that Bayesian model averaging (Rings et al. 2012; Henriksen et al. 2012; Refsgaard et al. 2012; Tsai and Elshall 2013) or model selection are not tests of the posterior, rather, they are consequences of the posterior distribution, yet untested! Classical checks are whether posterior models match data, but these are checks based on likelihood (misfit) only.

Consider a more elaborate testing framework. These formal test rely on generating replicates of the data given some model hypothesis and parameters are the truth. Take a simple example of a model hypothesis with two faults ð*H* = two faults) and the parameters **θ** representing those faults (e.g. dip, azimuth, length etc.). The bootstrap allows for a determination of achieved significance level ð Þ *ASL* as

$$ASL(\mathbf{0}) = P\left(S(\mathbf{d}\_{rep}) \ge S(\mathbf{d}\_{obs}) | H, \mathbf{0}\right) \tag{27.15}$$

here, we consider calculating some summary statistic of the data as represented by the function *S*. This summary statistic could be based on some dimension reduction method; for example, a first or second principal component score. The uncertainty on **θ** is provided by its posterior distribution, hence we can sample various **θ** from the posterior. Therefore we first sample **d***rep* from the following distribution (averaging out over posterior in **θ**Þ

$$P\left(\mathbf{d}\_{rep}|H,\mathbf{d}\_{obs}\right) = \int P\left(\mathbf{d}\_{rep}|H,\mathbf{\theta}\right)P(\mathbf{\theta}|H,\mathbf{d}\_{obs})d\mathbf{\theta} \tag{27.16}$$

and calculate average *ASL* over the posterior distribution. Analytically this equals to

$$ASL = \int ASL(\mathfrak{A}) P(\mathfrak{A}|H, \mathbf{d}\_{obs}) d\mathfrak{A} \tag{27.17}$$

or for given limited sample **<sup>θ</sup>**ð Þ <sup>ℓ</sup> , <sup>ℓ</sup>= 1, ... , *<sup>L</sup>* <sup>∼</sup>*P*ð Þ **<sup>θ</sup>**j*H*, **<sup>d</sup>***obs*

$$ASL = \frac{1}{L} \sum\_{\ell'=1}^{L} ASL \left(\Theta^{(\ell')}\right) \tag{27.18}$$

These tests are not used to determine whether a model is true, or even should be falsified but whether discrepancies exist between model and data. The nature of the functions *S* defines the "severity" of the tests (Mayo 1996). Numerous complex functions will allow for a more severe testing of the prior modeling hypothesis. We can learn how the model fails by generating several of these summary statistics, each representing different elements of the data (a low, a middle and some extreme case etc.…).

Within this framework of deductive tests, the prior is no longer treated as "absolute truth", rather the prior becomes a modeling assumption that is "testable" given the data. Some may however disagree on this point: why should the data be any better than the prior? In the next section, we will try to get out of this trap, by basing priors on physical processes, with the hope that such priors are more realistic representations of the universe of variability, rather than simply relying on statistical methods that are devoid of physics.

### **27.9 Bayesianism for Subsurface Systems**

### **What is the Nature of Geological Priors?**

#### **Constructing Priors from Geological Field Work**

In a typical subsurface system, the model variables are parameterized in a certain way, for example with a grid, or a set of objects with certain lengths, widths dips, azimuths etc. What is the prior distribution of these model variables? Since we are dealing with a geological system, e.g. a delta, a fluvial or turbidite systems, a common approach is to do geological field work. This entails measuring and interpreting the observed geological structures, on outcrops, and creating a history of their genesis, with an emphasis on generating (an often qualitative) understanding of the processes that generated the system. The geological literature contains a vast amount of such studies.

To gather all this information and render it relevant for modeling UQ, geological databases based on classification systems have been compiled (mostly by the Oil industry). Analog databases, for example, on proportions, paleo-direction, morphologies and architecture of geological bodies or geological rules of association (Eschard and Doligez 2000; Gibling 2006) for various geological environments (FAKT: Colombera et al. 2012; CarbDB: Jung and Aigner 2012; WODAD: Kenter and Harris 2006; Paleoreefs: Kiessling and Flügel 2002; Pyrcz et al. 2008) have been constructed. Such relational databases employ a classification system based on geological reasoning. For example, the FAKTS database classifies existing studies, whether literature-derived or field-derived from modern or ancient river systems, according to controlling factors, such as climate, and context-descriptive characteristics, such as river patterns. The database can therefore be queried on both architectural features and boundary conditions to provide the analogs for modeling subsurface systems. The nature of the classification is often hierarchical. The uncertain style or classification, often termed "geological scenario" (Martinius and Naess 2005) and variations within that style.

While such approach appears to gather information, it leaves the question of whether the collection of such information and the extraction of parameters values to state prior distribution produce realistic priors (enough variance, limited bias) for what is actually in the subsurface. Why?


The main limitation is that this pure parameterization-based view (the geometries, dimensions) lacks physical reasoning, hence ignore important prior information. The next section provides some insight into this problem as well as suggests a solution.

#### **Constructing Priors from Laboratory Experiments**

Depositional systems are subject to large variability whose very nature is not fully understood. For example, channelized transport systems (fan, rivers, delta, etc.) reconfigure themselves more or less continually in time, and in a manner often difficult to predict. The configurations of natural deposits in the subsurface are thus uncertain. The quest for quantifying prior uncertainty necessitates understanding the sedimentary systems by means of physical principles, not just information principles (such as the principle of indifference). Quantifying prior uncertainty thus requires stating all configurations of architectures of the system deemed *physically* possible and at what frequency (a probability density) they occur. This probability density need not be Gaussian or uniform. Hence, the question arises: what is this probability density for geological systems, and how does one represent it in a form that can be used for actual predictions using Bayesianism?

The problem in reality is that we observe geological processes over a very short time span (50 years of satellite data and ground observations), while the deposition of the relevant geological systems we work with in this chapter may span 100.000 years or more. For that reason, the only way to study such system is either by computer models or by laboratory experiment. These computer models solve a set of partial differential equations that describe sediment transport, compaction, diagenesis, erosion, dissolution, etc. (Koltermann and Gorelick 1992; Gabrovsek and Dreybrodt 2010; Nicholas et al. 2013). The main issue here is that PDEs are a limited representation of the actual physical process and require calibration with actual geological observations (such as erosion rules), require boundary conditions and source terms. Often their long computing times limit their usefulness for constructing complete priors.

For that reason, laboratory experiments are increasingly used to study geological deposition, simply because physics occurs naturally, and not as constructed with an artificial computer code. Next, we provide some insight into how laboratory experiments work and how they can be used to create realistic analogs of depositional systems.

#### **Experimenting the Prior**

We consider a delta constructed in an experimental sedimentary basin subject to constant external boundary conditions (i.e. sediment flux, water discharge, subsidence rates), see Fig. 27.8. The data set used is a subset of the data collected during

**Fig. 27.8** Flume experiment of a delta with low Froude number performed by John Martin, Ben Sheets, Chris Paola and Michael Kelberer. Image *source* https://www. esci.umn.edu/orgs/seds/Sedi\_ Research.htm

an experiment in the Tulane Delta Basin, conducted in 2010 (Wang et al. 2011). Basin dimensions were 4.2 m long, 2.8 m wide and 0.65 m deep. The sediment consisted of a mix of 70% quartz sand and 30% anthracite coal sand. These experiments are used for a variety of reasons. One of them is to study the relationship between the surface processes and the subsurface deposition. An intriguing aspect of these experiments is that much of the natural variability is not due to forcing (e.g. uplift, changing sediment source), but due to the internal dynamics of the system itself, i.e. it is autogenic. In fact, it is not known if the autogenic behavior of natural channels is chaotic (Lanzoni and Seminara 2006), meaning one cannot predict with certainty the detailed configuration of even a single meandering channel very far into the future. This then has an immediate impact on uncertainty in the subsurface in the sense that configuration of deposits in the subsurface cannot be predicted with certainty away from wells. The experiment therefore investigates uncertainty related to the dynamics of the system, our lack of physical understanding (and not some parameter uncertainty or observational error). All this is a bit unnerving, since this very fundamental uncertainty is *never* included in any subsurface UQ. At best, one employs a Gaussian prior, or some geometric prior extracted from observation databases, as discussed above. The fundamental questions are:


To address these questions and provide some insight (not an answer quite yet!), we run the experiment under constant forcing for long enough to provide many different realizations of the autogenic variability—a situation that would be practically impossible to find in the field. The autogenic variability in these systems is due to t temporal and spatial variability in the feedback between flow and sediment transport, weaving the internal fabric of the final subsurface system.

Under fixed boundary conditions, the observed variability in deposition is therefore the result of only the autogenic (intrinsic) variability in the transport system. The data-set we use here is based on a set of 136 time-lapse overhead photographs that capture the dynamics of flow over the delta approximately every minute. Figure 27.9 shows representative images from this database. This set of images represents a little more than 2 h of experimental run time. Figure 27.9b shows the binary (wet-dry) images for the same set, which will be used in the investigation.

The availability of a large reference set of images of the sedimentary system enables testing any statistical prior by allowing a comparison of the variability of the resulting realizations, since all possible configurations of the system are known. In addition, the physics are naturally contained in the experiment (photographs are the result of the physical depositional processes). A final benefit is that a physical analysis of the prior model can be performed, which aids in understanding what depositional patterns should be in the prior for more sophisticated cases.

**Fig. 27.9** Examples of a few photographic images of the flume experiment for different time periods. Flow is from top to bottom. **a** Photographs of the experiments. The blue pixels indicate locations where flow moves over the surface. The black sediment is coal which is the mobile fraction of the sediment mixture, and the tan sediment is sand. **b** Binary representation of the photographs. Black represents wet (flow) pixels, white represents dry (no flow) pixels

#### **Reproducing Physical Variability with Statistical Models**

In this study we employ a geostatistical method termed multiple-point geostatistics. MPS methods have grown popular in the last decade due to their ability to introduce geological realism in modeling via the training image (Mariethoz and Caers 2014). Similar to any geostatistics procedure, MPS allows for the construction of a set of stochastic realizations of the subsurface. Training images, along with trends (usually modeled using probability maps or auxiliary variables) constitute the prior model as defined in the traditional Bayesian framework. The choice of the initial set of training images has a large influence on the stated uncertainty, and hence a careful selection must be done to avoid artificially reducing uncertainty from the start.

It is unlikely that all possible naturally-occurring patterns can be contained in one single training image within the MPS framework (although this is still the norm; similarly, it is the norm to choose for a multi-Gaussian model by default). To represent realistic uncertainty realizations should be generated from multiple TIs. The set of all these realizations then constitutes a wide prior uncertainty model. The choice of the TIs brings a new set of questions: how many training images should one use, and which ones should be selected? Ideally, the TIs should be generated in such a way that natural variability of the system under study is represented (fluvial, deltaic, turbidite, etc.), hence all natural patterns are covered in the possibly infinite set of geostatistical realizations. Scheidt et al. (2016) use methods of computer vision to select a set of representative TIs. One such computer vision method evaluates a rate of change between images in time, and the training images are selected in periods of relative temporal pattern stability (see Fig. 27.10).

The training image set shown in Fig. 27.10 displays patterns consistent with previous physical interpretations of the fundamental modes of this type of delta system: a highly channelized, incisional mode; a poorly channelized, depositional mode; and an intermediate mode. This suggests that some clues to the selection of

**Fig. 27.10** Selected images by clustering based on the modified Hausdorff distance. The value at the top of the image represents the time in minutes of the experiment

appropriate training images lie in the physical properties of the images from the experiment.

With a set of training images available, multiple geostatistical realization per each training image can be generated (basically a hierarchical model of realizations). These realizations can now be compared with the natural variability generated in the laboratory experiments, to verify whether such set of realizations can in any way reproduce natural variability. Scheidt et al. (2016) calculate the Modified Hausdorff Distance (MHD, a distance used in image analysis), between any two geostatistical realization and also between any two overhead shots A QQ-plot of the distribution of the MHD between all the binary snapshots of the experiment and the MPS models is shown in Fig. 27.11a, showing similarity in distribution.

The result is encouraging but also emphasizes a mostly ignored question of what a complete geological prior entails, that the default choices (one training image, one Boolean model, one multi-Gaussian distribution) make very little sense when dealing with realistic subsurface heterogeneity. The broader question remains as to how such a prior should be constructed from physical principles and how statistical models, such as geostatistics should be employed in Bayesianism when applied to

**Fig. 27.11 a** QQ-plot of the MHD distances between the 136 images from the experiment and 136 images generated using DS. **b** Comparison of the variability, as defined by MHD, between generated realizations per each training image (red) and the images from the experiment (blue) closest (in MHD) to the selected TI

geological systems. This fundamental question remains unresolved and certainly under-researched.

### **Field Application**

The above flume experiments have helped in the understanding of the nature of a geological prior, at least for deltaic type deposits. Knowledge accumulated from these experiments will create scientific understanding on the fundamental processes involved in the genesis of these deposits and thereby understand better the range of variability of the generated stratigraphic sequences.

It is unlikely, however, that laboratory experiments will be of direct use in actual applications, since they take considerable time and effort to set them up. In addition, there is a question of how these scale to the real world. It is more likely in the near future that computer models, built from such understanding, will be used in actual practice. Various such computer models exist for depositional systems (process-based, process-mimicking, etc.).

We consider here one such computer model, FLUMY (Cojan et al. 2005), which is used to model meandering channels, see Fig. 27.12. FLUMY uses a combination of physical and stochastic process models to create realistic geometries. It is not an object-based model, which would focus on the end result, but it actually creates the depositional system. The input parameters are therefore a combination of physical parameters as well as geometrical parameters describing the evolution of the deposition.

Consider a simple application to an actual reservoir system (Courtesy of ENI). Based on geological understanding generated from well data and seismic, modelers are asked to input the following FLUMY parameters: channel width, depth and sinuosity (geometric), and two aggradation parameters: (1) decrease of the alluvium thickness away from the channel, and, (2) maximum thickness deposited on levees during an overbank flood. More parameters exist but these are kept fixed for this simple application.

**Fig. 27.12** Example of a FLUMY model with several realizations of the prior generated from FLUMY with uncertain input parameters

The prior belief now consists of (1) assuming the FLUMY model as a hypothesis that describes variability in the depositional system and (2) prior distributions of the five parameters. After generating 1000 s of FLUMY models (see Fig. 27.12), we can run the same analysis as done for the flume experiment to extract modes in the system that can be used as training images for further geostatistical modeling.

### **27.10 Summary**

Eventually philosophical principles will need to be translated into workable practices, ultimately into data acquisition, computer codes, and actual decisions. A summary of some important observations and perhaps also personal opinion based on this chapter are: • *Data acquisition, modeling and predictions "collaborate";* going from data to


of a suitable set of representative scenarios to represent the geological process taking place. This was illustrated in the flume experiment study.

• *Falsification of the posterior*. The posterior is the result of the prior model choice, the likelihood model choice and all of the auxiliary assumptions and choices made (dimension reduction method, sampler choices, convergence assessment etc. …). Acceptance of the posterior "as is" would follow the pure inductionist approach. Just as the prior, it would be good practice to attempt to falsify the posterior. This can be done in several ways, usual using hypothetico-deductive analysis, such as the significance tests introduced in this chapter.

### **References**


Fine A (1973) Probability and the interpretation of quantum mechanics. Br J Philos Sci 24(1):1–37


Lantuéjoul C (2013) Geostatistical simulation: models and algorithms. Springer Science & Business Media

Lanzoni S, Seminara G (2006) On the nature of meander instability. J Geophys Res Earth Surf 111(4) Lindgren B (1976) Statistical theory. MacMillan, New York


Zadeh L (1965) Fuzzy sets. Inf Control 8:338–353

Zadeh LA (1975) Fuzzy logic and approximate reasoning. D Reidel Publ Co 30(i):407–428

Zadeh LA (1978) Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets Syst 1(1):3–28

Zadeh LA (2004) Fuzzy logic systems: origin, concepts, and trends. Science, pp 16–18

Zinn B, Harvey CF (2003) When good statistical models of aquifer heterogeneity go bad: a comparison of flow, dispersion, and mass transfer in connected and multivariate Gaussian hydraulic conductivity fields. Water Resour Res 39(3):1–19

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **Chapter 28 Geological Objects and Physical Parameter Fields in the Subsurface: A Review**

**Guillaume Caumon**

**Abstract** Geologists and geophysicists often approach the study of the Earth using different and complementary perspectives. To simplify, geologists like to define and study objects and make hypotheses about their origin, whereas geophysicists often see the earth as a large, mostly unknown multivariate parameter field controlling complex physical processes. This chapter discusses some strategies to combine both approaches. In particular, I review some practical and theoretical frameworks associating petrophysical heterogeneities to the geometry and the history of geological objects. These frameworks open interesting perspectives to define prior parameter space in geophysical inverse problems, which can be consequential in under-constrained cases.

### **28.1 Introduction**

The earth is three-dimensional, heterogeneous and, for its major part, inaccessible to direct observations. A consequence is that the static and dynamic parameters governing physical processes below the earth surface are generally poorly known. A recurrent challenge for geoscientists and engineers is, therefore, to predict the likely nature or behavior of the subsurface from limited data. In all fields of geophysics sensu lato, these forecasts may use physically and mathematically-based data processing (such as upward continuation of potential fields, seismic processing, classical processing of ground penetrating radar (Nobakht et al. 2013), reservoir production decline curves (Davis and Annan 1989; Fetkovich 1980, Fig. 28.1a), or the resolution of an inverse problem that explicitly uses physical models computing observations from some earth parameters and physical parameters (Fig. 28.1b–d, f–h). In geology, forecasts (e.g., about the location and volume of a specific formation or resource) and geological scenarios involve direct

G. Caumon (✉)

GeoRessources-ENSG, Université de Lorraine – CNRS–CREGU, 2 rue du Doyen Marcel Roubault, 54500 Vandoeuvre-lès-Nancy, France

e-mail: Guillaume.caumon@univ-lorraine.fr

<sup>©</sup> The Author(s) 2018

B. S. Daya Sagar et al. (eds.), *Handbook of Mathematical Geosciences*, https://doi.org/10.1007/978-3-319-78999-6\_28

**Fig. 28.1** Examples of approaches and workflows using geological and geophysical data to make forecasts about the earth or the associated processes and some illustrative references. **a**-**d** workflows that use no or minimal geological prior information. **e** classical use of geophysical data in fundamental and applied geology. **f**-**h** workflows that explicitly incorporate geological parameters in the process observations and geophysical images (Jackson and Rotevatn 2013; Perrouty et al. 2014; Fig. 28.1e). In this process, the loop may not always close: in the end, the interpretations are not guaranteed to be compatible with the initial geophysical observations. This may or may not be a problem, depending on the purpose of this interpretation. For example, a qualitative match between reflection seismic data and structural interpretations is probably sufficient to discuss fault growth models (Jackson and Rotevatn 2013), whereas such mismatch can be problematic in other tasks such as natural resource assessment (Caumon 2010; Jessell et al. 2014). Another practical problem is the interpretation and fusion of several independent data sets corresponding to different physical or geological observations (Corbel and Wellmann 2015; Paasche 2016). Geostatistics (Chiles and Delfiner 2012; Goovaerts 1997) was historically developed with these problems in mind, and is an attractive theoretical framework to recombine point and volume data coming from geophysical images consistently with spatial statistics. However geological reasoning and statistical reasoning are of different nature (Frodeman 1995), so honoring some spatial statistics is very useful but not always sufficient to represent geological concepts. Therefore, several methodologies have been introduced to explicitly incorporate geological knowledge in subsurface interpretation, all of them explicitly considering geological objects (Fig. 28.1f–h).

The main focus of this chapter is to review the main frameworks by which geological concepts can be represented in earth models and inverse methods addressing several types of physics. Thus, it aims at complementing the existing reviews and discussions of Linde et al. (2015) and Jessell et al. (2014), who address this problem with similar objectives but different perspectives. As the topic is very vast, the reader is also referred to previous review papers related to this topic (Farmer 2005; Lelièvre and Farquharson 2016; Linde et al. 2015; de Marsily et al. 2005; Mosegaard and Hansen 2016; Oliver and Chen 2011; Pyrcz et al. 2015; Zhou et al. 2014a). Several books also present complementary perspectives and more complete descriptions and details (Agterberg 2014; Caers 2011; Mallet 2002, 2014; Perrin and Rainaud 2013; Pyrcz and Deutsch 2014). Section 28.2 provides further motivations for considering geology in geophysical models, and tries to define what "geology" means in that sense. Then, Sect. 28.3 briefly describes the type of parameterizations classically used in computational physics. We discuss some links between these physical parameterizations and the frameworks used to represent geological domains in Sect. 28.4.

### **28.2 Motivations for Explicit Geological Parameterizations**

A wealth of perspectives is essential and complementary to make progresses in the understanding of our planet and its resources. This is exemplified by the various disciplines involved in natural resource characterization, see for instance Ringrose and Bentley (2015). Feedbacks and interactions between the various approaches generate many types of possible workflows to integrate geological data and produce forecasts, as illustrated in Fig. 28.1. For example, geophysical processing and inverse methods that use minimal geological prior information (Fig. 28.1a–d) are typically considered as data for geological interpretations (Fig. 28.1e). Whereas these "minimal prior" approaches are not this chapter's focus, they are very useful and are always used to some extent in practical studies, because they provide at least a useful first-order view of the geological domain. This is illustrated in particular in deterministic workflows of Fig. 28.1h that strive for fit-for purpose, simplest as possible, subsurface models (Elrafie et al. 2008; Ringrose and Bentley 2015; Williams et al. 2004). They are also conceptually satisfying in the sense that they produce images or forecasts that mainly depend on the physics, hence can be claimed to be parsimonious and objective. As a consequence of this parsimony and of the non-linear nature of most involved physical processes, these models make it difficult to evaluate uncertainty (Watson et al. 2013). The term "objective" is also relative, as some choices are always made in these methods. In data processing methods these subjective choices relate to the underlying model assumptions (e.g., sub-horizontal layers). In inverse methods, choices must also be made about the parameterization, and a statistical model (e.g., the multi-Gaussian model) or a particular regularization (e.g., Mosegaard 2011).

Among the approaches that try to get the most out of the physics with minimal assumptions, recent and most promising developments use several types of data and petrophysical models to constrain local anisotropy, (see for instance Clapp et al. 2004; Ma et al. 2012; Sava et al. 2014; Zhou et al. 2014b) and recent reviews in geophysical imaging (Meju and Gallardo 2016), reservoir seismology (Bosch et al. 2010), hydrogeophysics (Linde and Doetsch 2016), mineral exploration (Lelièvre and Farquharson 2016), petroleum exploration (Moorkamp et al. 2016). Two main ideas underlie these approaches. First, some local structural orientations are inferred from borehole data or other geophysical data to constrain the covariance function used during inversion. Second, a petrophysical model is used to exploit the existing correlation between the physical parameters. As these correlations generally depend on the rock type, the model often includes discrete variables that estimate or sample the rock type at a given location. This notion of rock type is close to the notion of lithofacies, so it is a way to integrate geological reasoning into inverse methods.

In the field of reservoir engineering and hydrogeology, methods incorporating prior geological knowledge in flow and transport models have also been developed very early on, as discussed in several review papers (Farmer 2005; Linde et al. 2015; de Marsily et al. 2005; Oliver and Chen 2011; Zhou et al. 2014a). One fundamental reason is that flow and transport processes can be highly non-linear while pressure and concentration measurements are generally quite sparse as compared to the number of potential factors influencing fluid transfers in porous and fractured media. The same observation holds in potential field inversion, where geological prior information can significantly help addressing the ill-posedness of the inverse problem (Lelièvre and Farquharson 2016). But what does "geological prior" exactly mean?

As noted in particular by Frodeman (1995), geology is an interpretive science which includes a significant component of historical thinking. One aim of geology is to describe the earth in historical terms by identifying the main geological processes and their impact. In terms of scientific philosophy, it is interesting to highlight that geology generally produces refutable scenarios, whereas mathematics are concerned with formal and irrefutable proofs (given some hypotheses). The encounter of these two scientific methods is deeply written in the DNA of Mathematical Geosciences. Advanced methods in physically-based modeling have been developed to quantitatively model geological processes. Some very interesting inverse methods that use such models have been developed recently to quantitatively integrate spatial observations (Charvin et al. 2009; Cross and Lessenger 1999; Gallagher et al. 2009). These methods are ideal in the sense that they could in principle unify geology and geophysics rigorously. However, the interplay of multiple coupled physical and chemical processes at geological time scales remains extremely challenging to model on a computer. The use of such models in an inverse framework is also very challenging, as the number of unknown or poorly known parameters makes the inverse problem highly ill-posed and computationally intractable. This empty space problem is very general and applies to most inverse problems in geosciences, but it is critical when an explicit time dimension is considered because the density of information in time-space is very small (e.g., only a few points typically constrain pressure and temperature in basin studies). This explains why most of the methods in Fig. 28.1e–h do not explicitly consider geological time and instead use an object-based approach, a statistics-based approach or a combination of both to represent the geological prior information and make forecasts in the 3D physical space.

Classically, the object-based strategy is essential to the geological approach. For example, geological mapping typically decomposes a complex reality into discrete and interconnected tectonic, igneous, metamorphic, diagenetic, stratigraphic and sedimentological objects. These object definitions do integrate historical and process-based considerations. For instance, time is explicitly considered in the definition of the remarkable surfaces that sequence stratigraphers use to interpret geoscience data. The characterization of these objects in mathematical and computational terms has been a significant focus of the IAMG for that last 50 years. The statistics-based approach, another clear focus of the IAMG, is clearly complementary to the object-based approach. Indeed, objects are heterogeneous, boundaries between objects may be difficult to define and objects can be difficult to map from available observations. Statistical reasoning is key to address these problems. In this chapter, we will try to explain a few manners by which the object-based and statistics-based methods interact in the frame of geo-data and physical modeling integration. For this, we will start from the perspective of what physical modeling needs.

### **28.3 Parameterizations for Physical Models**

Sambridge et al. (2012), among others, give a very crisp and generic summary of the parameterizations used in most numerical physical modelling methods. In this view, a model *<sup>m</sup>*ð Þ**<sup>x</sup>** is defined at any point **<sup>x</sup>** of the physical space by a set of basis functions:

$$m(\mathbf{x}) = \sum\_{k=1}^{K} m\_k \rho\_k(\mathbf{x}).\tag{28.1}$$

For example, in the finite element method with linear triangular elements, a basis function *<sup>φ</sup><sup>k</sup>* is defined for each mesh vertex **<sup>x</sup>***k*: *<sup>φ</sup><sup>k</sup>*ð Þ **<sup>x</sup>***<sup>k</sup>* is equal to 1, *<sup>φ</sup><sup>k</sup>* **<sup>x</sup>***<sup>j</sup>*≠*<sup>k</sup>* is equal to 0 and *φ<sup>k</sup>* linearly decreases in the mesh elements adjacent to **x***k*. The values *mk* are the parameter values (e.g., thermal conductivity) associated to the mesh vertices.

The general formulation (28.1) allow to compute or approximate differential operators to solve partial differential equations describing physical processes. Many recent advances in computational physics consist of particular choices of basis functions. For instance, in the extended finite element method, the use of Heaviside basis functions to represent internal discontinuities in a mesh was a step change in the computation of fracture growth (Moës et al. 2002). Another very active research field concerns the combination of basis functions at several scales (e.g., Efendiev et al. 2013). These methods have been applied for instance in finite volume modeling of flow in porous media to solve the flow and transport equations at two distinct and interacting scales (Jenny et al. 2003; Møyner and Lie 2014).

Equation (28.1) is also compatible with the theory of spatial random fields. At point scale, the values *mk* are seldom known below the Earth surface. Geostatistics offers many ways to estimate or simulate such values (Chiles and Delfiner 2012; Goovaerts 1997) using statistical parameters inferred from subsurface data. One of these parameters is the variogram, which models the statistical correlation between two variables as a function of the distance. In dual kriging, Eq. (28.1) is also used, as the unknown value is estimated as a linear combination of covariance functions centered on the data points. The use of point-based parameterizations is also much studied in computational physics under the term "meshless methods", see for instance Liu and Gu (2005). In the practice of geostatistical methods, the values *mk* are generally modeled on a Cartesian grid, but recent papers also discuss about the application of geostatistics on unstructured grids (Gross and Boucher 2015; Manchuk et al. 2005; Zaytsev et al. 2016), or directly on points (Zagayevskiy and Deutsch 2016). A major interest of these methods is to estimate or simulate values directly on the physical modeling support, and also to use adaptive resolution depending on the local information density and on the sensitivity between the model parameters and the physical process.

Last, but not least, Eq. (28.1) is compatible with a new breed of inverse methods in which the number of parameters *K* is variable, see Sambridge et al. (2012) and references therein. These transdimensional inverse methods show much promise to address some of the challenges highlighted in this chapter.

The beauty of Eq. (28.1) lies in its potential to unify object-based geological descriptions and mathematical descriptions. In a sense, the goal of the various workflows described in Fig. 28.1 and in the associated references can be seen as a quest to find "geological basis functions" to model the earth. The purpose of Sect. 28.4 is to try to establish a more explicit correspondence between geological concepts and existing mathematical and computational models for representing geological domains in three-dimensional space. In doing so, we keep in mind that these 3D models will eventually need to be expressed by Eq. (28.1) in physical models.

### **28.4 Geological Parameterizations**

As discussed in Sect. 28.2, geologists apply the divide and conquer principle to analyze the earth. Hundreds of years of geological reasoning have essentially led to identify multiple geological features at various scales, depending on their origin:


These features typically exist at kilometric to micrometric scales (from plates to minerals and fluid inclusions). It is not useful (and not possible) for a model to explicitly represent all objects across these scales. Rather, most modeling approaches hierarchically subdivide the domain to represent a few nested scales (Pyrcz and Deutsch 2014; Ringrose and Bentley 2015).

Two main complementary mathematical and numerical frameworks exist to represent these geological features: spatial random fields and object-based methods. The choice of which framework is most appropriate (or whether and how these frameworks should be combined) depends on the size of the features with regard to the density of observations and on the likely impact of the features for the question at hand. Whereas the object size can be objectively discussed and characterized, the impact of features is often based on rules of thumb derived from experience (Ringrose and Bentley 2015). This may be a source of biases in forecasts. In practical studies, choices may also be constrained by very practical reasons, as some methods are implemented only in commercial software or in distinct software which are not interoperable. These problems and the need for better and abstract knowledge integration are also discussed by Perrin and Rainaud (2013).

### *28.4.1 Spatial Random Fields*

As geological processes are not random and result from many physical processes, the resulting spatial fields are generally correlated in space. The characterization of the correlation structure by statistical inference is an essential aspect of geostatistics (Chiles and Delfiner 2012; Goovaerts 1997). Indeed, trust can be gained when data are numerous enough to provide robust statistics—even though the modeling assumptions themselves may remain questionable (Journel 2005). In inverse modeling of flow and transport in porous media, this has led to many approaches that perturb parameters on a grid while preserving variogram or spatial covariance models (de Marsily et al. 2005; Oliver and Chen 2011; Zhou et al. 2014a).

In geostatistics, a result of the divide-and-conquer strategy used in geology is the definition of many types of discrete categories to describe the physical world. These categories can be localized in space in the form of a geological map (or, in three dimensions, a 3D geological model). From a geostatistical standpoint, categories can be modeled with indicator variables. This has led to significant advances, in particular in the field of multiple-point geostatistics (MPS), to represent discrete facies from sparse data and analog training images. Since the seminal work of Guardiano and Srivastava (1993), a vast community of mathematical geoscientists has embraced this field and made essential advances, see Hu and Chugunova (2008), Mariethoz and Caers (2014). In particular, MPS have opened concrete and effective ways to using complex (and deliberately subjective) geological priors models in inversion (Linde et al. 2015; Melnikova et al. 2015). MPS have shown, in a number of instances, the impact of applying analog reasoning and scenarios to find sensible sets of solutions to inverse problems and to assess uncertainties. They also make up an interesting formalism to analyze complex geological systems (Scheidt et al. 2016).

However, even though progresses can still be made (see for instance Renard et al. 2011), a recurrent challenge with the indicator geostatistical approaches is to ensure that some categories are always connected or adjacent to other categories. This is why, to echo a friendly discussion we had with Andre Journel in 2005, I persist considering that there is more to geological realism than MPS (in its spatial understanding). The Truncated Gaussian method and the Pluri-Gaussian methods (Armstrong et al. 2011), even though they rely on multi-Gaussian assumptions, enforce continuity conditions that approach geological reasoning in a very interesting way. For instance, they can produce consecutive successions of facies from shallow marine to offshore environments. This type of method is appropriate when the discrete geological categories originate from an underlying continuous variable (in the previous example, this variable can be assimilated to bathymetry, all facies being defined between consecutive threshold values). In the Pluri-Gaussian approach, the application of Boolean operations on simulated random fields is also a way to emulate the succession of geological events (e.g. simulation of late diagenetic facies overprinting the depositional facies).

In general, spatial random field methods are implemented on grids of fixed resolution. As a result, the discontinuities that may exist in the medium are sampled at that particular resolution. However, some important features such as fractures or shale lenses may be much smaller than the grid resolution, hence cannot be explicitly represented in the grid. Under some hypotheses, this can be addressed by directly modeling a field of equivalent properties assumed representative of the block scale (e.g., equivalent dual porosity and dual permeability fields in fractured media). However, this can be a source of bias in a number of cases (Jackson et al. 2014). The explicit consideration of these objects generally relies on fewer assumptions and provides a way to deal with more complex geometries and with spatial observations, as will be discussed in Sect. 28.4.2. Note that these two approaches are not mutually exclusive and a combination of both equivalent and explicit approaches are, in general, relevant (Bourbiaux et al. 2002; Maier et al. 2016).

Another important aspect of geological reality is that the orientation and the magnitude of spatial correlation can vary in space. This can be modelled with random fields using locally varying anisotropy (Boisvert et al. 2009; Stroet and Snepvangers 2005; Xu 1996). In geophysics, the use of local anisotropy is illustrated for instance by Clapp et al. (2004) and by the image-guided inversion methods mentioned in Sect. 28.2 and Fig. 28.1d. In the absence of exhaustive data to constrain these orientations, one should estimate or simulate the orientations away from local observations (Gumiaux et al. 2003; Stroet and Snepvangers 2005; Xu 1996). A practical challenge in the presence of locally varying anisotropy is the inference of geostatistical parameters, as the domain is non-stationary. Object approaches offer another way of dealing with locally varying anisotropy, as will be discussed in the next section.

### *28.4.2 Object Models*

In a general sense, object models directly represent the tectonic, sedimentological, intrusive and epigenetic features listed at the beginning of Sect. 28.4. As geological objects originate from distinct geological processes at different periods of time, they often correspond to contrasts or discontinuities of the physical parameters of interest. This explains why, beyond pure cartographic goals, so much effort is dedicated to object modeling in geosciences.

### **Geometry and Topology**

As discussed by Mallet (2002) and Perrin and Rainaud (2013), geological objects can be represented in geometrical and topological terms. Topology refers to essential characteristics: the dimension of objects (line, surface or volume), whether objects have inclusions or holes, and if they are connected to other objects. Depending on the type of geological objects, some topological configurations are impossible (Caumon et al. 2004). For instance, a chronostratigraphic horizon must be an open surface and may include internal holes due to faults or intrusions. More generally, the continuity of objects can have a relation to the genesis of the object, hence is a way to constrain geological models. Knowing what is topologically possible and what is not gives precious insights to design modeling methods and to test the validity of geological models (Pellerin et al. 2017; Wellmann et al. 2014). Topological analysis also provides interesting metrics to characterize and understand geological objects such as karsts (Collon et al. 2017), fracture networks (Sanderson and Nixon 2015) and structural models (Lindsay et al. 2013; Pellerin et al. 2015; Thiele et al. 2016a, b). Last, but not least, topology is very important for flow modeling, as it directly relates to the connectivity of permeability conduits and barriers. The links between connectivity and effective flow properties has been much studied at multiple scales in the frame of percolation theory (Berkowitz and Balberg 1993; King et al. 2001). In the cases where geological considerations are not sufficient to fully characterize the topology of the medium, specific methods have been proposed to find possible object geometry honoring some prescribed connectivity (Borghi et al. 2012; Collon-Drouaillet et al. 2012; Henrion et al. 2010).

Geometry concerns the embedding of the topological objects in 3D space, and is typically described either analytically (e.g., ellipses for fractures) or numerically (using a mesh). Meshes provide much flexibility to discretize the geometry of rock volumes (geological bodies), surfaces (geological boundaries) and lines (contacts between boundaries). All these geometric components are linked by topological relationships (Pellerin et al. 2017). More fundamentally, meshes are a way to define basis functions approximating the geometry of the true object. For example, one can define mathematically a triangulated surface as a set of a "hat" basis functions centered on each surface node (taking the value 1 at each node and linearly decreasing it to zero at the node's neighbors), as in Eq. (28.1). This description is very powerful to devise advanced geometry processing algorithms and reduce the dimensionality of complex geometrical shapes (Vallet and Lévy 2008). In the frame of inverse modeling, several inverse methods use the meshed model geometry as an unknown parameter (Fullagar et al. 2000; Gjøystdal et al. 1985; Mondal et al. 2010).

Over the past decade, computational advances have also made it possible to consider implicit surfaces to represent geological boundaries. In these approaches, the surfaces are considered as level sets of some three-dimensional scalar field (Calcagno et al. 2008; Cowan et al. 2003; Frank et al. 2007; Henrion et al. 2010). These methods share the same principles as the Truncated Gaussian and Pluri-Gaussian methods (Mannseth 2014), but the underlying random function model is not necessarily Gaussian, and their focus is set on the geometry of object boundaries. These level set methods are very powerful to automate geometric modeling tasks such as interpolation and extrapolation. In particular, they have shown much interest in stratigraphic modeling as one single scalar field can represent a conformable stratigraphic series at once, which opens new possibilities in structural data interpolation (Calcagno et al. 2008; Caumon et al. 2013; Hillier et al. 2014; Laurent et al. 2016). Implicit surfaces also offer very nice ways to consider geometric model perturbations needed to address inverse problems in geosciences (Cardiff and Kitanidis 2009; Caumon et al. 2007; Noetinger 2013; Zheglova et al. 2013). A major distinction between explicit and implicit surface models is about topological control: the surface topology has to be chosen before interpolation in explicit methods, whereas it emerges from the interpolation in implicit models, see also Collon et al. (2016) for more discussions.

As in Pluri-Gaussian simulation, it is possible to indirectly account for geological time in object models using the truncation between implicit or explicit objects (Calcagno et al. 2008; Caumon et al. 2009; Gjøystdal et al. 1985). Boolean operations also provide ways to obtain sharp features in object geometry using constructive solid geometry principles (Rongier et al. 2014; Ruiu et al. 2016). In terms of Eq. (28.1), Boolean operations between implicit objects can be described as indicator (or Heaviside) basis functions (Mannseth 2014; Moës et al. 2002): these functions are equal to zero on one side of the interface and equal to 1 on the other side. The representation of faults is a major challenge which is specific to geosciences. Indeed faults are not just discontinuities or sharp geometric features: they result from sliding of rocks that were previously connected. Several authors have proposed mathematical or numerical solutions to address this problem by considering directly or indirectly the displacement between either sides of a fault (Calcagno et al. 2008; Georgsen et al. 2012; Hale 2013; Holden et al. 2003; Jessell and Valenta 1996; Laurent et al. 2013; Mallet 2002, 2014).

#### **From Objects to Physical Parameters**

Generally, geological object geometry cannot be described analytically and determining the associated physical parameter fields is not straightforward. In most cases, objects are first discretized in space with a mesh that will support the numerical resolution of the physical equations (Kolditz et al. 2012; Pellerin et al. 2017). This mesh is a numerical translation of Eq. (28.1) discretizing the space in elementary volumes deemed representative of some effective physical properties (the values *mk* in Eq. (28.1)).

A possible working assumption is to consider a constant (or analytically defined) parameter value associated to each type of geological object. This principle is used for simplicity in a number of numerical models (Gjøystdal et al. 1985; Jackson et al. 2015). However, as discussed above, heterogeneity exists at many different scales and can have an impact on the physical process below the scale of the objects that are explicitly represented in a numerical model. For example, it is well known in stochastic hydrogeology and reservoir engineering that petrophysical heterogeneity exists within layers or sedimentary facies and impacts flow and transport (see for instance de Marsily et al. 2005 for a review). In many cases, the orientation of heterogeneities within a geological object depends on the object geometry (e.g., crystal orientations in a dyke may be preferentially aligned along the dyke boundaries; sedimentary heterogeneities tend to be more continuous along layers than orthogonally to layers). This can be addressed in modeling by explicitly using locally variable directions of anisotropy (Boisvert et al. 2009) or by considering a geometric transform between two spaces (Mallet 2014; Shtuka et al. 1996). This last option is very promising as it provides a way to simplify geostatistical modeling, and as it allows to define some useful geological variables such as the apparent sedimentation rate (Kedzierski et al. 2007; Mallet 2014; Massonnat 1999). Such use of indirect geological parameters is an essential and powerful way to introduce geological principles in earth models.

Nonetheless, one should not neglect that object geometry affects model predictions at the two main stages of geostatistical models: (1) geostatistical inference (distributions of continuous variables within each subdomain, multivariate relationships between different variables, trends, spatial variability) and (2) geostatistical modeling (interpolation or simulation). The separation of integrated modeling into an object-modeling phase and a petrophysical modeling phase are, therefore, relatively easy in the classical case where objects are known, when a clear separation of scales exist between representative elementary volume (REV) properties and object geometry, and when objects do not affect geostatistical parameters. However, uncertainty about object geometry and topology can have a significant impact on statistical parameters (Lallier et al. 2016), which can be a significant source of complexity in practical studies. More generally, finding at what scale explicit objects properties and REV effective properties can be separated is a fundamental problem in modeling. Therefore, more research is clearly needed to capture the interactions between object geometric (and topological) parameters and random field parameters.

#### **Object Uncertainty**

Geometric uncertainty can be sampled by adding geometric perturbations to an existing reference model (Caumon et al. 2007; Corre et al. 2000; Lecour et al. 2001) or creating several models after perturbing data (Lindsay et al. 2013; Wellmann et al. 2010). As the very existence of some objects is also uncertain in many cases, it is also useful to consider object-based stochastic simulation. In random set theory, geometric objects are placed randomly and independently in the domain by combining the simulation of points (Poisson Point Process) and the simulation of objects shapes around these points (see Chiles and Delfiner 2012 and references therein; Lantuéjoul 2002). Classically, objects are geometric primitives defined analytically, whose shape, orientation and size parameters are simulated from some input distribution. Random set theory places a lot of emphasis on the statistical aspects of this process and on conditioning to spatial data, see in particular Lantuéjoul (2002) and Allard et al. (2006). These models, in particular the Boolean Model, have been used to simulate many types of geological objects such as fractures (Chiles 1988), shale lenses (Haldorsen and Lake 1984) or sedimentary channels (Deutsch and Wang 1996; Holden et al. 1998). Extensions of the Boolean Model have also been proposed to introduce interactions between objects such as attraction or repulsion between fractures to reproduce their mechanical interactions (Aydin and Caers 2017; Bonneau et al. 2016; Chiles 1988; Hollund et al. 2002).

From a random set perspective, a deterministic object model is a particular realization of some underlying random set process. In this case, the relatively large data density allows one to consider mainly the data conditioning problem rather than focusing on the number of objects and on their spatial density. Another focus of deterministic object modeling approaches relates to the expert-guided definition of interactions between objects using interactive editing tools to ensure that the connectivity between objects is compatible with the geological history of the domain (e.g., how faults branch one onto another and how faults displace horizons).

Yet, more and more complex geometric object parameterizations have recently been introduced in object-based simulation methods. For instance, several authors propose to anchor sedimentary channels on discrete polygonal curves (Mariethoz et al. 2014; Pyrcz et al. 2009; Rongier et al. 2017; Ruiu et al. 2016; Viseur 2004). Other variants consider the bounding surfaces of stratigraphic deposits together with some rules to mimic depositional processes (Graham et al. 2015; Labourdette 2008; Michael et al. 2010; Pyrcz et al. 2005, 2015; Rongier et al. 2017; Ruiu et al. 2016; Sech et al. 2009). As argued in the review of Pyrcz et al. (2015), these models make it possible to consider genetic principles such as erosion, progradation and aggradation of sedimentary deposits in an automatic way. Similarly, pseudo-process-based models have also been proposed in the area of fracture modeling to approximate mechanical interactions and truncations that occur during fracture growth (Bonneau et al. 2013; Davy et al. 2013; Srivastava et al. 2005). At a larger scale, a recent trend has been to simulate possible stochastic geometries where the number and the connectivity of faults is variable (Aydin and Caers 2017; Cherpeau et al. 2010, 2012; Cherpeau and Caumon 2015; Holden et al. 2003; Julio et al. 2015a). In all these approaches, the use of rules is often a means to generate realistic objects and to produce likely connectivities and spatial features without being constrained by some input grid resolution. However, conditioning to dense spatial data sets remains challenging with these approaches. A possible way forward is to consider parameter-rich object-models and to consider process-based rules backward in time (Parquer et al. 2016; Ruiu et al. 2015). In all cases, expert control of model realism is also difficult and may call for additional "geological likelihood" functions to scrutinize the realizations (Jessell et al. 2010).

Interestingly, the use of continuous functions around the Poisson points used in object-based simulation (Random Function Model (Jeulin 2002)), is a possible way to relate random sets to Eq. (28.1). However, formalizing the link between object models and basis functions used in physical models is not easy and relies on the assumption that values are analytically defined on each object, and that objects have stationary statistics (Jeulin 2012; Oda 1986). Dealing with more realistic geometries and sequential Boolean operations to reproduce the succession of geological events calls for further numerical and mathematical developments. Meanwhile, as statistical properties of random sets are not easily checked in practical cases, the numerical approach to relate objects to physics clearly remains an area of much interest (Botella et al. 2016; Cacace and Blöcher 2015; Karimi-Fard and Durlofsky 2016; Merland et al. 2014; Mustapha 2011; Pellerin et al. 2014; Zehner et al. 2015).

### **28.5 Conclusions and Challenges**

Several complementary ways exist to incorporate geological information in earth models (Fig. 28.2): spatial statistics, geological variables, geometry and topology of geological objects and explicit geological process modeling. Links exist between the random field and object-based frameworks in cases where the canonical random field theory is applicable (e.g., homogeneous and stationary object densities). This forms the rationale for most modeling methods where "small" objects are treated though their (spatially correlated) equivalent properties at the representative elementary volume scale. "Large" objects are modeled explicitly using rules and parameters that incorporate geological principles and may be calibrated from data and analogs.

Although geostatistics has proven an invaluable theoretical framework to rigorously describe geological domains, it needs to be complemented by geological reasoning (*sensu* Frodeman 1995). Namely, considering discrete time steps approximating geological history and geological variables which cannot be directly measured can significantly help generating more predictive geological models, which may not always have stationary statistical properties. Geometric and topological interactions between objects have a direct connection to geological history and prove a powerful tool to characterize geological domains.

From a physical modeling perspective, geometric object models allow to represent small spatial features which can have a large impact on physical processes (Jackson et al. 2015; Julio et al. 2015b; Matthäi et al. 2007). This calls for specific developments in meshing and physical simulation, for example to better account for object features directly in the numerical code (Pichot et al. 2012). In the frame of inverse problems, sensitivity analysis is essential in practical studies. Theoretically, specific methods integrating the probability of existence of objects also need to be considered more widely, such as random vector parameterization (Cherpeau et al. 2012), reversible jump Monte-Carlo Markov Chain simulation (Green 1995; Sambridge et al. 2012) or ensemble-based methods (Scheidt and Caers 2009). Both in forward and inverse physical models, an additional and significant challenge is to better characterize the multi-sale interactions between geometrical and petrophysical parameterizations (basis functions and associated parameters values).

**Fig. 28.2** Summary of the various complementary ways to incorporate geological knowledge in earth models

**Acknowledgements** The ideas expressed in this chapter owe much to Bruno Lévy, Albert Tarantola, Andre Journel and Jean-Laurent Mallet. Their encouragements, trust and critical remarks have been essential influences. I am also grateful to the graduate students and colleagues of the Research for Integrative Numerical Geology (RING) Team, especially my colleagues Pauline Collon and Paul Cupillard, for their multiple contributions to such a stimulating research environment. Discussions in the frame of the HIWAI ANR project led by Yann Capdeville also fed some of the ideas about scale expressed in this chapter. Last, but not least, at a time where research funding is getting more and more complex, I express my great appreciation to the academic and industrial sponsors of RING-Gocad Consortium for their continued support and to ASGA for its effective Consortium management.

### **References**


Fetkovich MJ (1980) Decline curve analysis using type curves. SPE-4629-PA. June 1, 1980


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **Chapter 29 Fifty Years of Kriging**

**Jean-Paul Chilès and Nicolas Desassis**

**Abstract** Random function models and kriging constitute the core of the geostatistical methods created by Georges Matheron in the 1960s and further developed at the research center he created in 1968 at Ecole des Mines de Paris, Fontainebleau. Initially developed to avoid bias in the estimation of the average grade of mining panels delimited for their exploitation, kriging received progressively applications in all domains of natural resources evaluation and earth sciences, and more recently in completely new domains, for example, the design and analysis of computer experiments (DACE). While the basic theory of kriging is rather straightforward, its application to a large diversity of situations requires extensions of the random function models considered and sound solutions to practical problems. This chapter presents the origins of kriging as well as the development of its theory and its applications along the last fifty years. More details are given for methods presently in development to efficiently handle kriging in situations with a large number of data and a nonstationary behavior, notably the Gaussian Markov random field (GMRF) approximation and the stochastic partial differential (SPDE) approach, with a synthetic case study concerning the latter.

### **29.1 Introduction**

The creation of the IAMG is a landmark of year 1968, which motivates the present book. Another important event of this year is the foundation of a research center of Ecole des Mines de Paris dedicated to geostatistics and mathematical morphology, two disciplines created by Georges Matheron. Concerning geostatistics, this research center was about to develop the applications of kriging, invented by Matheron several years earlier. The theory of kriging seems so straightforward that

J.-P. Chilès (✉) <sup>⋅</sup> N. Desassis

Centre of Geosciences, Mines ParisTech, Fontainebleau, France e-mail: jean-paul.chiles@mines-paristech.fr; jean-paul@chiles.name

N. Desassis e-mail: nicolas.desassis@mines-paristech.fr

© The Author(s) 2018 B. S. Daya Sagar et al. (eds.), *Handbook of Mathematical Geosciences*, https://doi.org/10.1007/978-3-319-78999-6\_29

it was reasonable to imagine that, after some generalizations, kriging would become a classical tool requiring no further research. On the contrary, 50 years later it remains the subject of active research, with renewed points of view. Other paradox: originating from mining estimation problems, and very close to statistical regression from a theoretical standpoint, it was not obvious that kriging would be considered in other domains than mining and earth sciences. However applications now consider, for example, the design of aircrafts (Chung and Alonso 2002), the prediction of the mechanical properties of nanomaterials (Yan et al. 2012), the optimization of supply chain networks (Dixit et al. 2016), the construction of financial term-structures (Cousin et al. 2016), the modeling of social systems (Oliveira et al. 2013), and in all cases the quantification of the uncertainty.

It is therefore not surprising to see in Table 29.1 that the number of articles on kriging (word "kriging" or "cokriging" present in the title) published by the journals of the Scopus database doubles decade after decade. The situation is slightly different for the three journals published by the IAMG: *Mathematical Geosciences* (formerly *Journal of the International Association for Mathematical Geology*, then *Mathematical Geology*), *Computers & Geosciences*, and *Natural Resources Research*; indeed, IAMG journals played a major role in the dissemination of the geostatistical literature in English in the first decades, but have now to share this role with the journals of the new application domains. (Note incidentally that few articles were published before 1980: the literature relative to kriging was largely written in French or published in monographs and conference proceedings.)

At a closer look, the originality of kriging lies in its inclusion in the geostatistical approach, where the optimality provided by kriging rests on an analysis of the spatial variability of the phenomenon of interest. Indeed, if methods for characterizing that variability were lacking, the optimality of kriging would simply be virtual. As for the persistence of research works on kriging, it is widely bound to the evolution of the capacities of calculation and memory of computers, and to the increase of the volume of the data. At its origin kriging considered some samples in the vicinity of a target block, while it has now to take into account up to thousands or even millions of data (remote sensing, laser, seismic).

This chapter first presents the origins of kriging and its theory. It continues with further developments, roughly chronologically, up to current research. Kriging has a number of variants and generalizations. We focus here on linear kriging, moreover in a monovariate context. Cokriging and disjunctive kriging are therefore not


considered; conversely, the use of kriging to condition geostatistical simulations is acknowledged. Our aim is not a thorough presentation of kriging, which can be found in many textbooks, for example, Chilès and Delfiner (2012).

### **29.2 The Origins of Kriging**

One of the tasks of the mining engineer is to select the panels to be exploited, and even to delimit them if the exploitation method lets him this freedom. Indeed, to simplify, a panel deserves to be exploited only if the cost of its extraction and processing does not exceed the value of the metal which can be extracted from it. For given technico-economic parameters, this means that the panel grade has to exceed some cutoff grade. In practice the true grade of a panel is not known before its exploitation, so that the selection is made on the basis of an estimated grade. At the beginning of the 1950s the estimate was simply the average grade of the data belonging to the panel or situated at its border. Krige (1951, 1952), studying exploitation data of several orebodies, observed that for high cutoffs the panels selected that way were on average less rich than expected.

As Fig. 29.1 shows it, this is not really surprising. Two parallel galleries in a sub-horizontal deposit present segments AB and CD with grades above the cutoff, contrarily to the neighboring parts of the galleries. Therefore the decision is made to exploit the trapezoid ABDC, and its grade is anticipated to be equal to the weighted average of the grades of segments AB and CD. In fact, segments AC and BD do not represent the real border between rich and poor ores. The true (unknown) limits look like the dotted lines. Therefore, poor ore is exploited (and rich ore abandoned), so that the grade of the exploited ore is lower than expected.

Mathematically, this expresses a conditional bias: Denoting *Zv* the panel grade and *Z*̄the average grade of the cores situated within the panel, the conditional expectation *<sup>E</sup>*½*Zv*j*Z*̄ is not equal to *Zv*.

**Fig. 29.1** Illustration of the estimation bias. The panel ABDC to be exploited was delimited from the rich samples observed along AB and CD. Because the true border between rich and poor ores follows a line similar to the dotted line rather than segments AC and BD, poor ore will be exploited and rich ore abandoned. (from Matheron 1961)

To avoid this bias, Krige gives a weight λ to the average grade of the data situated in the panel and the complementary weight 1 – λ to the average grade of the orebody, λ being determined by linear regression (Krige in fact considered the lognormal case and worked with grade logarithm).

Also facing problems of mining estimation, Matheron studied Krige's work and generalized his approach by assigning a proper weight to each sample, these weights being determined so as to minimize the estimation variance under the condition that the weights sum to 1 (this condition simply expresses that the estimator is a weighted average of the data).

Matheron called this method "kriging" in honor to Danie Krige. To be accurate, according to Cressie (1990), the French term "krigeage" was coined by Pierre Carlier and first used at the French Commissariat à l'énergie atomique in the late 1950s, and Matheron translated it by "kriging" in Matheron (1963b) (the first appearance of "krigeage" found by the present authors in Matheron's work is Matheron 1960, where it is mentioned as an already known concept).

### *29.2.1 Ordinary Kriging (OK)*

Geostatistics considers natural variables distributed in space, whose behavior presents a large complexity of detail. These regionalized variables cannot be adequately represented by deterministic functions and therefore methods dedicated to random functions (RF) are considered. The theory of kriging as it is usually presented appears in Matheron (1962, 1963a). It takes place in the framework of an order-2 stationary random function (SRF) model. The regionalized variable of interest (here a grade) is considered as a realization of an SRF *Z*(*x*), where *x* denotes a point in a two- or three-dimensional space. *N* data are available, at locations *x*α, α = 1, 2, …, *N*, with values *Z*<sup>α</sup> = *Z*(*x*α). The target *Z*<sup>0</sup> is the value *Z*(*x*0) of *Z* at an unobserved point *x*0, or more generally the average value *Z*(*v*) of *Z* in a given cell or block *v*. The kriging estimator of *Z*<sup>0</sup> is by definition of the form

$$Z^\* = \sum\_{\alpha=1}^N \lambda\_{\alpha} Z\_{\alpha}$$

with weights λα summing to 1. The weights are chosen so as to minimize the variance of the estimation error *Z*\* – *Z*<sup>0</sup> subject to the condition on their sum. This leads to a linear system of *N* + 1 equations with *N* + 1 unknowns (the *N* weights λα and a Lagrange parameter μ):

$$\begin{cases} \sum\_{\beta} \lambda\_{\beta} \sigma\_{\alpha \beta} + \mu = \sigma\_{\alpha 0} & \alpha = 1, \dots, N \\\sum\_{\beta} \lambda\_{\beta} &= 1 \end{cases}$$

where σαβ denotes the covariance of the observations *Z*<sup>α</sup> and *Z*<sup>β</sup> and σα<sup>0</sup> the covariance of *Z*<sup>α</sup> and the target *Z*0. This is the ordinary kriging system. The ordinary kriging variance can then be expressed as:

$$
\sigma\_{\rm OK}^2 = \mathcal{E}(\boldsymbol{Z}^\* - \boldsymbol{Z}\_0)^2 = \sigma\_{00} - \sum\_{\alpha} \lambda\_\alpha \sigma\_{\alpha 0} - \mu
$$

where σ<sup>00</sup> denotes the variance of *Z*0.

### *29.2.2 Simple Kriging (SK)*

Note that the kriging system and variance do not require the knowledge of the mean. If the mean *m* were known, we would use an estimator of the form

$$Z^\* = \sum\_{\alpha} \lambda\_{\alpha} Z\_{\alpha} + \left(1 - \sum\_{\alpha} \lambda\_{\alpha}\right) m\_{\alpha}$$

without constraint on the weights, and the minimization of the estimation variance would lead to the simple kriging system

$$\sum\_{\beta} \lambda\_{\beta} \sigma\_{\alpha \beta} = \sigma\_{\alpha 0} \quad \alpha = 1, \dots, N$$

and to the simple kriging variance

$$
\sigma\_{\rm SK}^2 = \mathrm{E}(\boldsymbol{Z}^\* - \boldsymbol{Z}\_0)^2 = \sigma\_{00} - \sum\_{\alpha} \lambda\_{\alpha} \sigma\_{\alpha 0}
$$

Simple kriging receives limited applications. It is, however, important, because it has nice properties that are not shared by ordinary kriging and of course universal kriging (see Chilès and Delfiner 2012, Chap. 3). From a computational point of view, the kriging matrix being positive definite, the system can be solved by the Cholesky method.

### *29.2.3 Ordinary Kriging in the IRF Model*

Because the mean *m* is not involved in ordinary kriging, it is possible to extend ordinary kriging to a more general random function model, the (order-2) intrinsic random function (IRF) model, characterized by

$$\begin{aligned} \mathbb{E}[Z(\alpha+h) - Z(\alpha)] &= 0\\ \frac{1}{2}\mathbb{E}[Z(\alpha+h) - Z(\alpha)]^2 &= \gamma(h) \end{aligned}$$

The variogram γ(*h*) summarizes the spatial variability of the random function. Geostatistics provides a set of consistent tools for choosing the variogram model adapted to a particular situation (e.g., Chilès and Delfiner 2012, Chap. 2). The above OK system and OK variance remain valid provided that *C*(*h*) is formally replaced by –γ(*h*) in the expressions of σαβ, σα<sup>0</sup> and σ<sup>00</sup> given in the next section. This is the framework where kriging is widely used, especially in mining applications.

### *29.2.4 Discussion*

Finally, kriging appears as nothing but (a straightforward generalization of) multiple linear regression on *N* data *Z*<sup>α</sup> that need not to be of the form *Z*(*x*α). Does it deserve a special consideration?

In fact the application of this regression requires that the covariances between the observations, and between each observation and the target, are known. They can be determined experimentally when repeated measurements are available, as is the case in meteorology, but not in usual earth sciences applications, where a unique phenomenon is considered. Applying the regression formula with a priori covariances would provide an estimator that would lose any optimality, except if by chance these covariances are perfectly suited to the data.

Kriging implies a spatial context:


The covariances σαβ are then of the form *C*(*x*<sup>β</sup> – *x*α), and σα<sup>0</sup> is *C*(*x*<sup>0</sup> – *x*α) if the target is *Z*(*x*0) or the average value of *C*(*x* – *x*α) when *x* spans *v* if the target is *Z*(*v*). The variance σ<sup>00</sup> of *Z*<sup>0</sup> that appears in the expression of the kriging variance is *C*(0) if the target is *Z*(*x*0) or the average value of *C*(*x*′ – *x*) when *x* and *x*′ span *v* independently if the target is *Z*(*v*).

Several authors proposed an approach similar to simple or ordinary kriging before Matheron but not in a spatial context (see Cressie 1990). The noticeable exception is Gandin (1963), who independently developed an approach similar to Matheron's one, in meteorology. SK is called *optimal interpolation*, and OK *optimal interpolation with normalization of weighting factors*. Like Matheron, Gandin was concerned by the theory and its applications; he is, for example, the first author to define and compute a variogram cloud.

### *29.2.5 Analytic Calculation of Average Covariances*

In the early 1960s computers were not available, at least for mining applications. It was therefore not easy to solve linear systems of equations. Even if point (or core) data could be used to determine the variogram, kriging was applied to aggregated data. In the case of Fig. 29.1, a typical situation examined by Matheron (1961), all cores along AB are represented by their average grade *Z*1, those along CD by *Z*2, and those belonging to A′A and BB′ by *Z*3. The target is the average grade *Z*<sup>0</sup> of the trapezoid ABDC. Kriging amounts to finding the best weights λ<sup>1</sup> for *Z*1, λ<sup>2</sup> for *Z*2, and λ<sup>3</sup> = 1 – λ<sup>1</sup> – λ<sup>2</sup> for *Z*<sup>3</sup> minimizing the variance of λ<sup>1</sup> *Z*<sup>1</sup> + λ<sup>2</sup> *Z*<sup>2</sup> + (1 – <sup>1</sup> – λ2) *Z*<sup>3</sup> – *Z*0. Kriging amounts to solving a system of two equations, which is straightforward, but first requires to calculate the various covariances involved. For example, if the series of contiguous cores along AB is described by a three-dimensional elongated volume *s* and the target block (the trapezoid ABDC in projection on the horizontal plane, with some thickness in the vertical direction) by *v*, σ<sup>10</sup> represents <sup>1</sup> j*s*jj*v*j R *s* R *<sup>v</sup> <sup>C</sup>*ð*x*′<sup>−</sup> *<sup>x</sup>*<sup>Þ</sup> *dx*′ *dx*, which is a sextuple integral. A special variogram model, the logarithmic or de Wijsian model, was widely used because it is very tractable for analytical calculations of average covariances with Taylor expansions (see numerous technical reports of Matheron on the internet site of Mines ParisTech, Center of Geosciences, On-line geostatistical library).

### **29.3 Development and Maturity: Trend, Neighborhood Selection**

With the availability of computers in the late 1960s, it was possible to solve linear systems with about 10–20 equations. Kriging was then carried out with about ten data in and around the target block. Usually a neighborhood of one or two rings or aureolae around the target was used. If necessary, some data were grouped whose situations with respect to the target were similar. At the first international geostatistical congress in Rome in 1975, Michel David claimed that he was able to krige a mining block for a few cents, a reasonable price for real-world applications (David 1976).

In mining applications the outputs were documents with grid cells representing the blocks; the block estimates and the associated kriging standard deviations were printed in the grid cells. Very soon applications emerged in other domains than mining, with a slightly different objective: cartography, more precisely contour mapping. See, for example, Huijbregts and Matheron (1971), Chauvet and Chilès (1975) in oceanography; Delfiner (1973), Chauvet et al. (1976) in meteorology; Delfiner and Delhomme (1975), Delhomme (1978) in hydrology. Moreover, the phenomena considered in these application domains usually present a trend: the sea floor is deeper when moving away from the coast line, aquifers have a general gradient, the top of petroleum reservoirs is usually dome shaped. This called for developments in two directions: kriging theory, with universal kriging to account for trends, and kriging practice, with a careful design of kriging neighborhoods.

### *29.3.1 Universal Kriging (UK)*

The assumption of a constant mean—even if unknown—became soon a limitation for the application of kriging to phenomena displaying a trend. Kriging was therefore generalized by Matheron (1969) to random functions with a polynomial drift *m*(*x*) of the form

$$m(\alpha) = \sum\_{\ell=0}^{L} a\_{\ell} f^{\ell}(\alpha)$$

where the *<sup>a</sup>*<sup>ℓ</sup> are unknown coefficients and the *<sup>f</sup>* <sup>ℓ</sup>ð*x*<sup>Þ</sup> are the *<sup>L</sup>* + 1 monomials with degree up to a given degree *<sup>k</sup>* (in the one-dimensional case, *<sup>L</sup>* <sup>=</sup> *<sup>k</sup>* and *<sup>f</sup>* <sup>ℓ</sup>ð*x*Þ<sup>=</sup> *<sup>x</sup>*<sup>ℓ</sup>). For <sup>ℓ</sup> = 0, *<sup>f</sup>* <sup>0</sup>ð*x*Þ<sup>≡</sup> 1. The kriging estimator remains of the form *<sup>Z</sup>*\* <sup>=</sup> <sup>∑</sup><sup>α</sup> λα*Z*<sup>α</sup> but, because the *a*<sup>ℓ</sup> are not known, unbiasedness is ensured only under the *L* + 1 constraints

$$\sum\_{\alpha} \lambda\_{\alpha} f\_{\alpha}^{\ell} = f\_0^{\ell} \quad \ell = 0, \dots, L$$

where *f* <sup>ℓ</sup> <sup>α</sup> <sup>=</sup> *<sup>f</sup>* <sup>ℓ</sup>ð*x*α<sup>Þ</sup> and *<sup>f</sup>* <sup>ℓ</sup> <sup>0</sup> is *<sup>f</sup>* <sup>ℓ</sup>ð*x*0<sup>Þ</sup> if the target is *<sup>Z</sup>*(*x*0) or the average value of *<sup>f</sup>* <sup>ℓ</sup>ð*x*<sup>Þ</sup> when *x* spans *v* if the target is *Z*(*v*). The minimization of the estimation variance leads to a system similar to the OK system except that there are now *L* + 1 constraints instead of a single one, and as many Lagrange parameters.

The UK kriging matrix is no more positive definite, so that the kriging system should be solved by Gaussian elimination, which is less efficient than the Cholesky method. However, UK can be expressed as simple kriging, followed by a drift correction. The second step appears as the solution of a linear system of *L* + 1 equations with *L* + 1 unknowns, whose matrix is positive definite. It is thus advantageous to exploit this result to solve the SK system and the drift correction system by the Cholesky method (an additivity property also allows the calculation of the UK variance).

The equations of UK were already presented by Goldberger (1962) but not in a spatial context and with covariances supposed to be known, whereas Matheron proposed tools for determining the underlying variogram in the presence of a drift. These tools let appear an inference problem that was adequately solved in the framework of a more general model, presented hereafter.

### *29.3.2 Kriging in the IRF-***k** *Model*

Like the mean for OK, the coefficients *a<sup>ℓ</sup>* are not involved in universal kriging. This made it possible to extend it to a more general random function model, the model of intrinsic random functions of order *k* (IRF-*k*), where a generalized covariance function *K*(*h*) is substituted to *C*(*h*). The RF model was first presented by Yaglom and Pinsker (1953), and the complete theory in the *n*-dimensional space by Matheron (1971, 1973). It suffices to say here that the class of GCs includes ordinary covariances and covariances of the form –γ(*h*) when *k* = 0, and increases with *k*. It includes, for example the power covariances (–1) *<sup>p</sup>*+1 |*h*| <sup>2</sup>*<sup>p</sup>*+1, 0 ≤ *p* ≤ *k*, and the "spline" covariances (–1) *<sup>p</sup>*+1 |*h*| <sup>2</sup>*<sup>p</sup>* log |*h*|, *p* integer, 1 ≤ *p* ≤ *k*. The kriging system is the same as for UK, with *K* replacing *C*.

### *29.3.3 Kriging as an Interpolant*

In cartography, the objective of the applications of kriging was more precisely to draw maps with isolines derived from point kriging at the nodes of a regular grid. Nowadays it is possible to locally refine the grid to precisely track an isoline. In both cases, there is a requirement that kriging is not only the optimal linear estimator for a single point or block but also has nice interpolation properties.

According to theory, when kriging is considered as an interpolant, that is, as a function *z* \* (*x*) of the target point *x*, the kriged map inherits from the covariance or variogram model. Indeed the universal kriging estimate can be presented in its dual form

$$z^\*(\boldsymbol{\chi}) = \sum\_{\mathbf{a}} b\_{\mathbf{a}} C(\mathbf{x} - \mathbf{x}\_{\mathbf{a}}) + \sum\_{\ell} c\_{\ell} f^{\ell}(\mathbf{x})$$

with the convention that *C* can be replaced by –γ or by the generalized covariance *K*. The coefficients *b*<sup>α</sup> and *c*<sup>ℓ</sup> are linear functions of the data. They are obtained as solutions of a system of equations similar to the UK system (same kriging matrix). If the variogram is parabolic at the origin, then *z* \* (*x*) is differentiable; if the variogram is linear at the origin (and thus with a cusp at the origin when considered as a function of vector *h*), *z* \* (*x*) is continuous with cusps at the data points. This may not be aesthetically nice from the user's point of view, because this is not primarily the purpose of kriging. Nevertheless, a smooth map can always be obtained by applying kriging with a smooth variogram or generalized covariance model. This is the way splines were used at that time, without explicit reference to geostatistics, but Matheron (1981) showed that any spline problem is equivalent to a kriging problem in the framework of the IRF-*k* model. For example, in 2D, interpolating with biharmonic splines is equivalent to kriging in the framework of an IRF-1 model with the generalized covariance |*h*| <sup>2</sup> log |*h*|. Of course if the "true" covariance model does not conform to this model, kriging loses its optimality.

### *29.3.4 Neighborhood Selection*

The dual kriging approach is very efficient in terms of computer time but presents two limitations: (i) it does not provide the kriging variance, and (ii) like direct kriging, its above interpolation properties are valid when working globally, that is, all data points are taken into account (global neighborhood). Due to practical limitations in memory space and calculation time, there is a limit in the number *N* of data that can be processed (several hundreds at that time, several thousands now). Therefore, in practice kriging often continues to be used with a moving neighborhood, that is, a limited number of data points around the target point are taken into account.

Now, when kriging with a moving neighborhood, the neighborhoods of two grid nodes can differ, and this can produce spurious discontinuities, especially when an outlier data is included in the neighborhood of a grid node and not in the neighborhood of the next grid node.

The neighborhood problem is also important when building conditional simulations. The classical way at that time (and even now) for continuous variables was to work in the framework of a Gaussian RF model (if necessary after suitable transformation of the data), to generate a nonconditional simulation of the Gaussian RF, and to condition that simulation on the data with a kriging step (Journel 1974). Due to their random nature, nonconditional simulations present small-scale variations. If spurious discontinuities are added by the kriging step, it is not easy to distinguish them from natural variations, which can lead to inaccurate conclusions.

Therefore, during years, much effort was devoted by software developers to neighborhood selection (e.g., Renard and Yancey 1984). Sophisticated algorithms have been devised to reach a compromise between near and far sample points. Focusing on 2D only, neighborhoods usually include all points of the first ring and then more distant points, following a strategy that attempts to sample all directions as uniformly as possible while keeping the number of points as low as possible (octant search). Typically, 16 to 32 points are retained, from at least five octants or four noncontiguous octants. For contour mapping purposes, where continuity is important, larger neighborhoods may be considered to provide more overlap. Such an algorithm may not provide satisfactory results when data originate from profiles sampled with a short interval. The neighborhood selection then includes the requirement to have data originating from several profiles. Along years, the size of the neighborhoods increased with the improvements of computers in terms of CPU time and storage.

### *29.3.5 Maturity*

In the 1980s kriging seemed to have reached maturity. It was widely used in mining projects to build block models of orebodies, even with a large number of sample data and a very large number of blocks. In civil engineering it enabled an accurate design of the Channel tunnel on the basis of a model of the geological layers obtained by kriging from about 100 000 data, with a sound evaluation of the uncertainty of the model (Blanchin and Chilès 1993; Chilès and Delfiner 2012, Sect. 3.8). There were further developments specific to nonlinear geostatistics (disjunctive kriging, indicator kriging) and to multivariate geostatistics (factorial kriging analysis) which are not considered here.

At the same period, Sacks et al. (1989) opened a completely new domain to kriging: the design and analysis of computer experiments (DACE). The coordinates of *x* are no longer geographic but represent scalar design variables, while the variable of interest *Z* is an objective function that depends on the design variables. A computer experiment gives the value of the objective function for chosen values of the design variables. When computer experiments are costly, kriging is used to interpolate the response surface from a limited number of data (computer experiments). Applications mainly concern engineering problems, for example, the design of aircrafts (Chung and Alonso 2002). They call for specific research works, due to the very special space considered, the sparsity of the data, the difficulty to infer the covariance. See Kleijnen (2016) for a recent review.

### **29.4 Iterative Use of Kriging to Handle Inequality Data**

Up to the early 1980s, geostatistics provided direct solutions: kriging was obtained by solving a linear system of equations, (Gaussian) simulations were built by turning bands or other methods directly transforming a vector of independent standard normal random variables in a vector representing a discrete view of the random function. Iterative algorithms appeared to handle inequality data and more specifically to generate conditional simulations of truncated Gaussian RFs.

Inequality data were already considered in the 1980s, notably by Dubrule and Kostov (1986) and Kostov and Dubrule (1986), with a solution based on quadratic programming where inequality data are treated as constraints placed on the kriging estimate. At the end, the inequalities are classified either as inactive (they can be forgotten) or active, and in the latter case they are replaced by an equality to the upper or lower bound of the inequality. This classification is not trivial at all and is the value of the method, but the clamping effect produced by the replacement of some inequalities by their lower or upper bound is not really satisfactory.

An alternative approach proposed by Langlais (1990) is to regard inequalities as data and replace them by exact values. The procedure is to (i) simulate exact data satisfying the given inequalities while honoring the exact data and the spatial structure, (ii) average the results over several simulations, thus generating data that will replace the inequality data, and (iii) proceed to kriging from both actual and generated data.

At the same period, truncated Gaussian RFs were considered to represent geological facies. In its simplest form, such RF is defined by a Gaussian SRF *Y*(*x*) and a threshold *s*. The truncated Gaussian RF is simply the indicator 1*Y*(*x*)<sup>≥</sup> *<sup>s</sup>*. The applications account for a threshold that varies with *x* (an ordinary function of *x*). More general models are obtained with several thresholds and possibly two or three Gaussian SRFs (plurigaussian RF). Matheron et al. (1987) proposed a method to build conditional simulations of truncated Gaussian RFs in the case of a separable exponential covariance. The method is rather simple because it fully exploits the Markov properties of that covariance model.

From that time the geostatistics community devoted a growing interest to Markov chain Monte Carlo (MCMC) methods (e.g., Tjelmeland and Holden 1993), and particularly to the Gibbs sampler (Geman and Geman 1984). Initially developed to solve optimization problems, these methods also provide useful algorithms for generating simulations of RFs at a finite number of sites (e.g., grid nodes). The Gibbs sampler gives a consistent iterative method to achieve the first step of Langlais (1990), which is the critical one: simulate exact data satisfying the inequalities. Let us consider that the inequality data are of the form *Z*<sup>α</sup> ∈ *B*<sup>α</sup> for some values of α, where *B*<sup>α</sup> denotes an interval. The procedure is initialized by generating each of these *Z*<sup>α</sup> separately, by a value *z*<sup>α</sup> chosen in the interval *B*α. Then the following sequence is repeated:


The procedure changes the simulated values at the inequality sites so that they progressively honor the spatial structure given by the covariance. This approach finds its theoretical justification in the ideal case of a Gaussian SRF with a known mean, where the conditional distribution of *Z*<sup>α</sup> is Gaussian with mean and variance equal to the kriging estimate and the kriging variance. It is however robust and is used even in the case of an unknown mean. The same approach is used effectively to generate conditional simulations constrained by inequality data, and especially truncated Gaussian RFs (the 0 or 1 data are transformed in inequality data of the form *Y*(*x*α) < *s* or *Y*(*x*α) ≥ *s*). The algorithm should be used in global neighborhood; otherwise, care should be given to the neighborhood selection, because the algorithm may diverge.

### **29.5 Nonstationary Covariance**

Up to now we have considered models with a stationary covariance. But reality does not care about our theoretical models. If a stationary covariance is often a reasonable assumption when a limited number of samples is available, large data sets usually show some lateral variations in the covariance or variogram, so that a global model with a stationary covariance would be a too crude approximation. This problem is obviously not new. A simple solution is to split the study domain into several subdomains, to determine a specific variogram in each subdomain, and to krige each subdomain with its own variogram. To avoid discontinuities at subdomains boundaries, the variogram parameters evolve progressively from one model to the next in a transition area. This ad hoc method was used, for example, for the study of the Channel tunnel where the 100 000 data clearly showed structural variations along the 60 km of the tunnel project. Machuca-Mory and Deutsch (2013) generalize and systematize this approach.

Global nonstationary covariance models are of course sounder than the previous approach from a theoretical point of view, and also from a practical one if they can adapt to actual situations. A simple global covariance model can be derived by generalization of the covariogram model, defined by autoconvolution of an integrable and square integrable function *w*(*u*):

$$\mathbf{g}(h) = \int \mathbf{w}(\mu)\mathbf{w}(\mu + h)d\mu$$

If we replace *w*(*u*) by a dilution or kernel function *w*(*x*; *u*) also depending on *x*, integrable and square integrable in *u* whatever *x*, and define

$$\mathbf{g}(\mathbf{x}, \mathbf{x}') = \int \mathbf{w}(\mathbf{x}; \boldsymbol{\mu}) \boldsymbol{\nu}(\mathbf{x}'; \boldsymbol{\mu}) d\boldsymbol{\nu}$$

then *g*(*x*, *x*′) is a nonstationary covariance model (e.g., Higdon et al. 1999). A random function with that covariance can be obtained by the dilution method (Higdon 2002).

Let us now examine the case where *w*, considered as a function of *u* for fixed *x*, is a Gaussian kernel with variance–covariance matrix Σ*x*. The resulting correlation function can be written (e.g., Paciorek and Schervish 2006)

$$\log(\mathbf{x}, \mathbf{x}') = |\Sigma\_{\mathbf{x}}|^{1/4} |\Sigma\_{\mathbf{x}'}|^{1/4} \left| \frac{\Sigma\_{\mathbf{x}} + \Sigma\_{\mathbf{x}'}}{2} \right|^{-1/2} \exp(-Q\_{\mathbf{x}\mathbf{x}'})$$

with quadratic form

$$\mathcal{Q}\_{\mathbf{x}\mathbf{x}'} = (\mathbf{x}'-\mathbf{x})^{\mathrm{T}} \left(\frac{\Sigma\_{\mathbf{x}} + \Sigma\_{\mathbf{x}'}}{2}\right)^{-1} (\mathbf{x}'-\mathbf{x})^{\mathrm{T}}$$

If Σ*<sup>x</sup>* is constant with respect to *x*, then *g*(*x*, *x*′) is the standard Gaussian correlation function with global anisotropy matrix Σ*x*. Otherwise, if Σ*<sup>x</sup>* varies slowly, *g* is approximately stationary in a small neighborhood of *x*. This locally stationary correlation function can be generalized by replacing expð<sup>−</sup> *Qxx*′Þ by <sup>ρ</sup>ð<sup>−</sup> *Qxx*′Þ where ρ is a stationary correlation function that is valid in every dimension. This class of nonstationary covariance functions can be fitted by using local variograms whose parameters are used to build local Σ*<sup>x</sup>* matrices (e.g., Fouedjio et al. 2016). Emery and Arroyo (2018) describe a spectral algorithm for simulating such models.

### **29.6 Kriging for Large Data Sets**

We have seen that kriging with moving neighborhoods provides artifacts that can be limited in their amplitude by a careful design of the neighborhood selection but not eliminated. This problem is important when putting the Gibbs algorithm into practice because the procedure might diverge. The best way to avoid artifacts is to krige in global neighborhood, that is, any target point is kriged from all the data. As the capabilities of computers in terms of memory and computational performance always increase, this becomes possible for larger and larger data sets. However, the size of most data sets is also increasing with the advent of automatic measurement stations, so that the problem remains. A direct solving of the kriging system by Gaussian elimination or the Cholesky method is possible for up to several thousand equations. Several attempts were made for processing larger systems. Before presenting two truly global approaches, let us start with a method deriving from moving neighborhoods.

### *29.6.1 Continuous Moving Neighborhood*

Gribov and Krivoruchko (2004) developed an original method to ensure continuity with moving neighborhoods. The idea is to modify the kriging system so that data beyond a specified distance from the estimated point receive weights gradually approaching zero. This way, no discontinuity occurs when data points enter or exit the kriging neighborhood.

Rivoirard and Romary (2011) propose an equivalent approach from a different perspective: The idea is to introduce a penalty on the kriging weights in the objective function to be minimized. This penalty acts as a noise variance except that it varies with the target point *x*0. It is typically equal to 0 for data points *x*<sup>α</sup> within a distance *r* of the estimated point *x*<sup>0</sup> (no penalty applied near the target point), and increases continuously to infinity as *x*<sup>α</sup> approaches the outer boundary of the kriging neighborhood, located at a distance *R*. Data points at a distance larger than *R* thus receive a zero weight. Because this method is solely based on the addition of a noise that increases with distance, it works for all versions of kriging algorithms: OK, UK, and even IRF-*k*. Because it is local, this method can handle lateral changes in the covariance parameters.

### *29.6.2 Covariance Tapering*

Large systems can be solved if the kriging matrix is sparse. This can be achieved by tapering the covariance function to zero beyond a certain range. Furrer et al. (2006), who proposed this approach, define the tapered covariance as the product of the true covariance *C* by a taper covariance *K* that has a finite range. To preserve the behavior of the true covariance *C* near the origin, which controls the lateral continuity of the interpolant, the taper covariance *K* should be more regular near the origin than *C*. The authors apply the method with about 6 000 data.

### *29.6.3 Fixed Rank Kriging*

In order to reduce the complexity of the kriging system when the number of data is very large, Cressie and Johannesson (2008) represent *Z*(*x*) as a linear combination of *r* given basis functions *Sk*(*x*) with random coefficients η*k*, plus a white noise ε(*x*) (for simplicity, we omit the covariates considered by the authors as external drift functions):

$$Z(\boldsymbol{\alpha}) = \sum\_{k=1}^{r} \mathfrak{n}\_k \mathcal{S}\_k(\boldsymbol{\alpha}) + \mathfrak{e}(\boldsymbol{\alpha})$$

The basis functions need not be orthogonal. They are usually chosen so as to represent several scales of variation and, for each scale, to cover the whole study domain. A typical choice is wavelet functions.

Denoting by **S**(*x*) the vector of the basic functions *Sk*(*x*), by **K** the variance– covariance matrix of the η*k*, and assuming that the white-noise variance is constant and equal to σ<sup>2</sup> , the covariance of *Z*(*x*) and *Z*(*x*′) is

$$\mathbf{C}(\mathbf{x}, \mathbf{x}') = \mathbf{S}(\mathbf{x})^\mathsf{T} \mathbf{K} \mathbf{S}(\mathbf{x}') + \sigma^2 \,\,\delta(\mathbf{x}' - \mathbf{x})$$

where δ is the Kronecker function.

Given a vector **Z** of *N* data Z(*x*α), the kriging matrix is

$$\boldsymbol{\Sigma} = \mathbf{S} \mathbf{K} \mathbf{S}^{\mathsf{T}} + \sigma^2 \mathbf{I}$$

where **S** is the *N* × *r* matrix whose (α, *k*) element is *Sk*(*x*α). The authors show that the inverse of **Σ** (an *N* × *N* positive-definite matrix) in fact only requires the inversion of **<sup>K</sup>** and **<sup>K</sup>**–<sup>1</sup> <sup>+</sup> **<sup>S</sup>**<sup>T</sup> **<sup>S</sup>**/σ<sup>2</sup> (two *<sup>r</sup>* <sup>×</sup> *<sup>r</sup>* positive-definite matrices). They also show that the inference of the positive-definite matrix **K** and the variance σ<sup>2</sup> can be done with the classical geostatistical approach. Therefore, kriging becomes tractable even with a very large number of data. In an application to ozone satellite data, the authors use 396 basis functions, a huge reduction in comparison with the 173 000 data.

### *29.6.4 Gaussian Markov Random Field Approximation*

The approach of Gaussian Markov random fields may be seen as the opposite of that of covariance tapering in the sense that it seeks to make the inverse of the covariance matrix—and not the covariance matrix itself—sparse. It was first used to generate simulations (Besag 1974, 1975) but offers a new approach to kriging (Rue and Held 2005). Let us consider a Gaussian random vector **Z** = {*Zi*: *i* = 1, …, *N*} with known mean **m** and variance–covariance matrix **C**. The conditional distribution of *Zi* given the other components {*Zj*: *j* ≠ *i*} is Gaussian with mean and variance the kriging estimate *Z*\* <sup>−</sup>*<sup>i</sup>* of *Zi* (the minus sign recalls that *Zi* is excluded from the data used for that kriging) and the associated kriging variance σ<sup>2</sup> K*i* . Denoting by **B** the inverse of **C**, the kriging weights are found to be equal to <sup>λ</sup>*<sup>j</sup>*ð*i*Þ<sup>=</sup> <sup>−</sup>*Bij* ̸*Bii* so that we have

$$Z\_{-i}^{\*} = m\_i - \frac{1}{B\_{ii}} \sum\_{j \neq i} B\_{ij} \left(Z\_j - m\_j\right) \qquad \sigma\_{\rm Ki}^2 = \frac{1}{B\_{ii}}$$

Since *Bii* is the inverse of the conditional variance of *Zi* given {*Zj*: *j* ≠ *i*} (all except the *i*-th), **B** is known as the precision matrix. Its off-diagonal elements are related to the conditional correlations of *Zi* and *Zj* given {*Zk*: *k* ≠ *i*, *j*} by

$$\operatorname{Corr}(Z\_i, Z\_j | \{Z\_k \colon k \neq i, j\}) = -\frac{B\_{ij}}{\sqrt{B\_{ii} B\_{jj}}}$$

**B** is a symmetric positive-definite matrix. The pattern of zeroes of **B** can be used to define an undirected graph structure in which two nodes are connected by an edge when *Bij* ≠ 0. Let ne(*i*) denote the neighborhood of node *i*, that is, the set of nodes connected to *i* by an edge. The vector **Z** has the Markov property that *Zi* is conditionally independent of {*Zk*: *k* ∉ ne(*i*)} given {*Zj*: *j* ∈ ne(*i*)}. The discretely indexed Gaussian **Z** is called a Gaussian Markov random field (GMRF).

If the *N* components *Zi* are split in *N*<sup>1</sup> unknown components to be estimated and *N*<sup>2</sup> = *N* – *N*<sup>1</sup> data, it can be shown that kriging can be achieved by solving a linear system of *N*<sup>1</sup> variables and *N*<sup>1</sup> equations whose system matrix is that part of the precision matrix **B** corresponding to the *N*<sup>1</sup> unknown components. The GMRF approach is used when this matrix is sparse, so that the system can be solved even when *N*<sup>1</sup> is large.

### *29.6.5 The Stochastic Partial Differential Equation (SPDE) Approach*

Although the GMRF approach seems particularly appealing to deal with large data sets, its use remained limited due to the fact that the link with the geostatistical models based on covariance functions was not clear, making it difficult to parameterize the precision matrix. Nevertheless, some empirical studies showed that the commonly used covariance functions could be approximated quite closely by GMRFs (e.g., Rue and Tjelmeland 2002; Hrafnkelsson and Cressie 2003). These results spurred some authors to model the data by using a Gaussian field characterized by its covariance and then to find a discretized GRMF for which the inverse of the associated precision matrix **B** provides a good approximation of the covariance matrix of the Gaussian field (Song et al. 2008; Cressie and Verzelen 2008). Although promising, these algorithms suffer from a lack of theoretical foundations, which makes their application difficult.

In their seminal paper, Lindgren et al. (2011) propose a formal link between Gaussian field and GRMFs. They use a result established by Whittle in the 1950s linking some Gaussian fields and the solutions of a class of SPDEs. More precisely, let us consider the Matérn covariance function

$$C(h) = \frac{\sigma^2}{2^{\nu - 1}\Gamma(\nu)} \left(\frac{|h|}{a}\right)^{\nu} K\_{\nu}\left(\frac{|h|}{a}\right).$$

where σ<sup>2</sup> is the sill parameter, *a* > 0 is the scale parameter, ν > 0 is a regularity parameter which determines the mean-square differentiability of the Gaussian field and *K*<sup>ν</sup> is the modified Bessel function of the second kind and order ν. The result of Whittle (1954) states that a Gaussian field *Z* with Matérn covariance function *C* is a solution of the linear fractional SPDE

$$(\kappa^2 - \Delta)^{\alpha/2} Z(s) = \mathfrak{r} \, W(s) \quad s \in \mathbb{R}^d$$

where α = *ν*+ *d* ̸2, *κ* = 1 ̸*a*, τ<sup>2</sup> = <sup>Γ</sup>ð*ν*+*<sup>d</sup>* ̸2Þð4π<sup>Þ</sup> *d* ̸2 *κ*2*<sup>ν</sup>* <sup>Γ</sup>ð*ν*Þ , <sup>Δ</sup> is the Laplacian operator, and *W* is a Gaussian white noise with unit variance. The pseudo-differential operator ð*κ*<sup>2</sup> <sup>−</sup> <sup>Δ</sup>Þ *<sup>α</sup>* ̸<sup>2</sup> can be defined through its Fourier transform but it is simply a linear combination of iterated Laplacians when α ̸2 is an integer.

Then, by using some numerical methods to solve the PDE, for example, a finite differences method (FDM) or a finite elements method (FEM), Lindgren et al. (2011) show that the resulting discretized field at the mesh points (which can include the data locations) is a discrete GRMF. The precision matrix is directly provided by the FDM or FEM implementation. It is a sparse matrix although the number of non-zero elements increases with ν. Therefore, by including the target points in the mesh generation, one can perform kriging with very large data sets by using an efficient solver for sparse matrices. Note that, when α is not an integer, the operator <sup>ð</sup>*κ*<sup>2</sup> <sup>−</sup> <sup>Δ</sup>Þ *<sup>α</sup>* ̸<sup>2</sup> has to be approximated by ∑*<sup>p</sup> <sup>i</sup>*= 0 <sup>λ</sup>*i*Δ*<sup>i</sup>* <sup>1</sup> ̸<sup>2</sup> , where *p* is the smallest integer greater than α. This operator can also be discretized by a FDM or FEM.

Anisotropies can be handled with the operator ð*κ*<sup>2</sup> <sup>−</sup> divð*H*.∇ÞÞ*<sup>α</sup>* ̸<sup>2</sup> where **<sup>H</sup>** is a symmetric positive-definite matrix linked to the anisotropy matrix and div is the divergence operator.

An interesting feature of the SPDE approach is that it allows to easily incorporate varying coefficients. For instance, the matrix **H** can be replaced by **H**(*s*) to handle a varying anisotropy (see Fuglstad et al. 2015).

Figure 29.2 presents a synthetic vertical section that could represent a variable of interest such as porosity in a sedimentary layer. The base and top of the layers were obtained by standard geostatistical simulations. The variable of interest was built according to the model of Fuglstad et al. (2015) with α = 3/2, the matrix **H** incorporating the anisotropy model depicted in Fig. 29.3. This anisotropy model was deduced from the model of the base and top of the layer, with a constant range along the local direction of the layer, and a shorter range, varying proportionally to layer thickness, in the orthogonal direction. Figure 29.2 shows five vertical "drill-holes" considered as the data set, and Fig. 29.4 shows the kriged section obtained with the SPDE method. The latter shows the capability of this approach to account for the anisotropy model even in areas where there are no data (provided of course that information is available concerning the anisotropy). From a computational point of view, the method is extremely efficient: in 2D a data set with about 100 000 data can be processed in about 10 s on a standard computer, with possibly a number of conditional simulations nearly in the same time.

**Fig. 29.2** SPDE synthetic case study: "Reality" (in fact a simulation) and sampling of five drill-holes

**Fig. 29.3** SPDE synthetic case study: Anisotropy model

**Fig. 29.4** SPDE synthetic case study: Kriging from the data of the five drill holes

### **29.7 Iterative Algorithms for Solving the Kriging System**

Before to conclude, it is advisable to remind a presentation of two iterative kriging algorithms by Jean-François Royer in 1974, that is, in the early times of geostatistics. In meteorology, at that time, two main approaches were used to carry out the "objective analysis", that is, the interpolation of temperature and pressure at the nodes of a grid from the observations at time *t*, then used as input for a numerical weather forecast at time *t* + 1. One is Gandin's approach (1963), similar to simple kriging (in meteorology, the mean can be considered known thanks to a long sequence of observations). The other is an iterative approach, the method of successive corrections proposed by Cressman (1959).

Royer (1975) considers the simple monovariate situation. Rewritten with present notations, let us consider a vector **z** with *N* = *N*<sup>G</sup> + *N*<sup>S</sup> components *zi*, the first *N*<sup>G</sup> components corresponding to grid nodes (*i* ∈ *G* = {1; …; *N*G}) and the other *N*<sup>S</sup> components corresponding to observation stations (*i* ∈ *S* = {*N*<sup>G</sup> + 1; …; *N*<sup>G</sup> + *N*S}); *zi* represents the variable of interest, at location *xi*. Because the average situation for the season or month considered is known from past observations, we can subtract it and assume that **z** has mean **0**. Two iterative algorithms are proposed, depending on the set of points that drives the changes (grid nodes or observation stations). In both cases, an influence function ρ(*h*) is used for extending a change made at location *x* to location *x* + *h* depending on separation *h*. This function satisfies ρ(0) = 1 and decreases to 0 when *h* increases. When extending to *xj* a change made at *xi*, the notation ρ*ij* = ρ(*xj* – *xi*) will be used.

*Algorithm driven by grid nodes:* As step *p* = 0, select a vector **z** <sup>0</sup> with components *z*<sup>0</sup> *<sup>i</sup>* , for example zeroes or the values of the weather forecast for time *t* based on the objective analysis made at time *t* – 1. Then iterate as follows:


3. define model *p* as *z p <sup>i</sup>* =*z p*−1 *<sup>i</sup>* + ∑ *j*∈*S <sup>ρ</sup>ij*ð*zi* <sup>−</sup> *<sup>z</sup> p*−1 *<sup>i</sup>* <sup>Þ</sup>, *<sup>i</sup>*<sup>∈</sup> *<sup>G</sup>* <sup>∪</sup> *<sup>S</sup>*

*Algorithm driven by the observations:* As initial state, select a vector **z** new with components *znew <sup>i</sup>* , for example zeroes or the values of the weather forecast for time *t* based on the objective analysis made at time *t* – 1. Then iterate as follows:


The convergence of both algorithms is ensured if and only if the matrix **ρ** defined by the ρ*ij* is positive definite, which is ensured if ρ(*h*) is a correlogram. Moreover, in that case, the iterative process converges to the solution of dual kriging. Indeed, both approaches amount to an iterative resolution of the dual kriging system (by the Jacobi method in the first approach, by the Gauss-Seidel method in the second one), followed, after each iteration, by the propagation of the changes to the point kriging estimates.

The second algorithm is very similar to the Gibbs propagation algorithm proposed nearly 40 years later by Lantuéjoul and Desassis (2012) to simulate a Gaussian vector (this algorithm is also presented in Chilès and Delfiner 2012, Sect. 7.6.3; it constitutes a further step to an algorithm proposed by Galli and Gao 1999). It is this similarity that reminded one of the present authors the paper of Royer, not exploited by geostatisticians to our knowledge, which should deserve new consideration. These iterative algorithms have the advantage that they can be used even with a very large number of data, notably when the Cholesky method cannot be used.

### **29.8 Conclusion**

We have shown the long way from Krige's regression, which took account of two average sample grades (a local one and a global one) to avoid bias in the estimation of a panel, to present applications of kriging, which can deal with few data (e.g., a limited number of computer experiments in applications to DACE) as well as several hundred thousand data (remote sensing, seismic). We have seen the large diversity of application domains of kriging, so that is it probable that many users do not know the origin of the word: this is the price of success.

We also gave a look at current research to enable a global application of kriging to large data sets, with the requirement to also benefit from nonstationary random function models. Much work remains necessary to transform them in standard methods applicable to a large variety of situations but, in view of the large community of researchers and developers in this area, no doubt that it will be done. The future will show which approaches are the most efficient ones.

### **References**


Yan JW, Liew KM, He LH (2012) A mesh-free computational framework for predicting buckling behaviors of single-walled carbon nanocones under axial compression based on the moving Kriging interpolation. Comput Methods Appl Mech Eng 247–248:103–112

Whittle P (1954) On stationary processes in the plane. Biometrika 41(3–4):434–449

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **Chapter 30 Multiple Point Statistics: A Review**

**Pejman Tahmasebi**

**Abstract** Geostatistical modeling is one of the most important tools for building an ensemble of probable realizations in earth science. Among them, multiple-point statistics (MPS) has recently gone under a remarkable progress in handling complex and more realistic phenomenon that can produce large amount of the expected uncertainty and variability. Such progresses are mostly due to the recent increase in more advanced computational techniques/power. In this review chapter, the recent important developments in MPS are thoroughly reviewed. Furthermore, the advantages and disadvantages of each method are discussed as well. Finally, this chapter provides a brief review on the current challenges and paths that might be considered as future research.

### **30.1 Introduction**

Characterization and modeling of geological structures have been investigated for several years in geosciences. Geostatistics is one of the such methods that can be used to analyze the data effectively. Such analysis can be performed both spatially and temporally. Lack of data is one of the intrinsic issues in the earth science applications, which causes a significant uncertainty and ambiguity in these problems. Kriging, as one of the most widespread geostatistical tools, was developed for dealing with such problems. The basic mathematically equations of Kriging, after developing by Daniel Krige, was further advanced by Matheron (Journel and Huijbregts 1978; Matheron 1973). Kriging is a deterministic method, meaning that it only produces one outcome from the available sparse data, which intrinsically cannot be used to effectively quantify the uncertainty. This method requires a prior model of variability and correlation between the variables, known as the variogram

P. Tahmasebi (✉)

Department of Petroleum Engineering, University of Wyoming, Laramie, WY 82071, USA e-mail: ptahmase@uwyo.edu

<sup>©</sup> The Author(s) 2018 B. S. Daya Sagar et al. (eds.), *Handbook of Mathematical Geosciences*, https://doi.org/10.1007/978-3-319-78999-6\_30

(Chiles and Delfiner 2011; Cressie and Wikle 2011; Deutsch and Journel 1998; Goovaerts 1997; Kitanidis 1997).

It has been shown that Kriging produces excessively smooth results (Deutsch and Journel 1998; Journel and Zhang 2006) and it cannot represent the heterogeneity and non-smooth phenomena. One consequence is the underestimation and overestimation for low and high values, respectively. This problem becomes evident when important parameters such as water breakthrough is intended to be predicted. Thus, the results of Kriging cannot be used for these situations as they ignore the connectivity and variability.

Stochastic simulation can be used to overcome the limitations of Kriging (Goovaerts 1997; Journel and Huijbregts 1978). Several simulation methods have been proposed that can produce various equi-probable realizations. Methods such as sequential Gaussian simulation (SGSIM) and sequential indicator simulation (SISIM) have become popular among different fields of earth sciences. These methods, give a number of "realizations" or interpolation scenarios, which allow assessing the uncertainty and quantifying it more accurately. It should be noted that Kriging is still the main algorithm used in the above stochastic methods. An example for the application of Kriging and stochastic modeling is provided in Fig. 30.1.

Due to relying on variogram (i.e. covariance), kriging-based geostatistical simulations are not able to reproduce complex patterns. Clearly, considering only two points is not sufficient for reproducing complex and heterogeneous models. Thus, several attempts in the recent years in the context of multiple point geostatistics (MPS) have been made that can use more than two points simultaneously. Using the information from multiple points require a big source of data, which is not usually available in the earth science problems as they come with sparse and incomplete data. Such data, instead, can be browsed in the form a conceptual image, called training image (**TI**).

**Fig. 30.1** Comparison between the results of Kriging (**b**) and stochastic simulation (**c**) using conditioning point data in (**a**)

Technically, geostatistical methods can be divided into three main groups. Object-based (or Boolean) simulation methods are in the first group (Kleingeld et al. 1997). These methods consider the medium as a group of stochastic objects that are defined based on a specific statistical distribution (Deutsch and Wang 1996; Haldorsen and Damsleth 1990; Holden et al. 1998; Skorstad et al. 1999).

Pixel-based methods are considered in the second group. These methods are based a set of points/pixels that represent various properties of a phenomenon. Mathematically speaking, such methods vary from the LU decomposition of the covariance matrix (Davis 1987), sequential Gaussian simulation (Dimitrakopoulos and Luo 2004), frequency- domain simulation (Borgman et al. 1984; Chu and Journel 1994), simulated annealing (Hamzehpour and Sahimi 2006), and the genetic algorithm. The last two methods, namely optimization techniques, also belong to this group as they gradually change an earth model in a pixel-by-pixel manner.

Each of the above methods has some advantages and limitations. For example, geological structures can be reproduced accurately using the object- based simulations. However, conditioning in these methods to well and soft data require intensive computation.

Pixel-based methods simulation on one pixel at a time. Such techniques produce the conditioning point data exactly. One drawback of these methods is that they are based on variograms that represent two-point statistics and, thus, they cannot reproduce the complex and realistic geological structures. Consequently, the generated models using these techniques cannot represent an accurate representation of any physics-based simulations (e.g. flow, grade distribution, contaminate forecasting and etc.).

In the MPS methods, the spatial statistics are not either extracted using variogram, but a conceptual tool named training image (TI), which is an example of the spatial structure to be reproduced, is provided that can represent the necessary data. During the recent years, several MPS methods have been developed to address issues related to CPU time and improved graphical representation of the models produced. This chapter, thus, reviews the existing concepts in MPS and discusses the available methods. The main two-point based stochastic simulation methods are first reviewed. Then, the basic terminologies and concepts of MPS are demonstrated. Next, different MPS methods are explained and the advantages and disadvantages associated with each method are demonstrated. Finally, some avenues for future research are discussed.

### **30.2 Two-Point Based Stochastic Simulation**

The smoothing effect of Kriging can be avoided using the sequential simulation, which helps to quantify the uncertainty accurately. Consider a set of *N* random variables *<sup>Z</sup>*ð Þ **<sup>u</sup>***<sup>α</sup>* , *<sup>α</sup>* = 1, ... ,*<sup>N</sup>* defined at locations **<sup>u</sup>***α*. The aim of sequential simulation is to produce realizations f g *<sup>z</sup>*ð Þ **<sup>u</sup>***<sup>α</sup>* , *<sup>α</sup>*= 1, ... , *<sup>N</sup>* , conditioned to *<sup>n</sup>* available data and reproducing a given multivariate distribution. For this aim, the multivariate distribution is decomposed into a set of *N* univariate conditional cumulative distribution functions (*ccdfs*):

$$\begin{aligned} F(\mathbf{u}\_1, \dots, \mathbf{u}\_N; z\_1, \dots, z\_N | (n)) &= F(\mathbf{u}\_1; z\_1 | (n)) \times \\ F(\mathbf{u}\_2; z\_2 | (n+1)) &\times \dots \times \\ F(\mathbf{u}\_{N-1}; z\_N - 1 | (n+N-2)) &\times \\ F(\mathbf{u}\_N; z\_N | (n+N-1)) \end{aligned} \tag{30.1}$$

where *<sup>F</sup>*ð**u***N*;*zN*jð ÞÞ *<sup>n</sup>* <sup>+</sup>*<sup>N</sup>* <sup>−</sup><sup>1</sup> <sup>=</sup> *Prob* <sup>f</sup>*Z*ð Þ **<sup>u</sup>***<sup>N</sup>* <sup>≤</sup> *zN*jð Þg *<sup>n</sup>*<sup>+</sup> *<sup>N</sup>* <sup>−</sup><sup>1</sup> is the conditional ccdf of *<sup>Z</sup>*ð Þ **<sup>u</sup>***<sup>N</sup>* conditioned to a set of *<sup>n</sup>* original data and ð Þ *<sup>N</sup>* <sup>−</sup><sup>1</sup> previously simulated values.

### *30.2.1 Sequential Gaussian Simulation (SGSIM)*

In this method, the multivariate distribution and the higher order are constructed based on the lower order statistical such as histogram and variogram. In other words, the mean and covariance matrix are used to build a Gaussian function. Therefore, along a random path, the mean and variance of the Gaussian distribution is estimated via Kriging and Kriging variance. The overall algorithm of SGSIM can be summarized as follows. First, a random path is defined over all visiting points on the simulation grid. Then, the *ccdf* at each node based on the hard data and previously simulated data are considered in Kriging. Then, a random value from the obtained Gaussian *ccdf* is drawn and added to the simulation grid. Next, based on the predefined random path, another node is chosen and simulated. Finally, another realization can be generated using a different random path.

It is worth noting that the conditioning data should be normally distributed. If it is not the case, it entails transforming them into a Gaussian distribution in order to be useable for SGSIM. Finally, the results must be back-transferred at the end of simulation. Such transformations can be accomplished using normal-score transforms or histogram anamorphous through Hermite polynomials.

### *30.2.2 Sequential Indicator Simulation (SISIM)*

Indicator simulation follows the same principle as SGSIM. This method, however, is suited for categorical data, which do not have an order relationship. Typical examples in earth science are rock type, lithology codes and some other categorical properties. The similar sequential procedure based on the estimation of the *ccdf* conditioning to neighboring data is applied here as well. This algorithm is based on two-point indicator variograms, which represent the spatial variability of each category. An indicator variable is defined for each variable, equal to 1 if at location *<sup>u</sup>* a particular category is found, and zero otherwise. Also, *E I* f g ð Þ **<sup>u</sup>** <sup>=</sup>*<sup>p</sup>* is the stationary proportion of a given category. The indicator variogram can be expressed as:

$$\begin{aligned} &\operatorname{Prob}\{I(\mathbf{u})=1, I(\mathbf{u}+h)=1\} \\ &\mathbf{0}=E\{I(\mathbf{u})I(\mathbf{u}+h)\} \\ &\mathbf{0}=\operatorname{Prob}\left\{I(\mathbf{u})=1|I(\mathbf{u}+h)=1\right\} \end{aligned} \tag{30.2}$$

Usually, the categorical variables expressed as a set of *K* discrete categories that *<sup>z</sup>*ð Þ **<sup>u</sup>** f g 0, ... , *<sup>k</sup>* <sup>−</sup> <sup>1</sup> . Therefore, the indicator value for each of the defined classes can be expressed as:

$$I(\mathbf{u},k) = \begin{cases} 1 & Z(\mathbf{u}) = k \\ 0 & \text{otherwise} \end{cases} \tag{30.3}$$

The aim of the indicator formulation is to estimate the probability of *<sup>Z</sup>*ð Þ **<sup>u</sup>** to be less than the predefined threshold for a category conditional to the data (*n*) retained:

$$\begin{split} I^\*(\mathbf{u}, z\_k) &= E^\*(I(\mathbf{u}, z\_k) | (n)) \\ &= \operatorname{Prob}^\*(Z(\mathbf{u}) < z\_k(n)) \end{split} \tag{30.4}$$

We can rewrite the above equation for categorical variables by using simple Kriging as:

$$\begin{aligned} I^\*(\mathbf{u}, z\_k) - E(I(\mathbf{u}, k)) &= \sum\_{a=1}^n \lambda\_a(\mathbf{u}) (I(\mathbf{u}\_a, k) - E\{I(\mathbf{u}\_a, k)\}) \\ \sum\_{k=1}^K I^\*(\mathbf{u}, k) &= 1 \end{aligned} \tag{30.5}$$

where *E I* f g ð Þ u, *<sup>k</sup>* is the marginal probability for category *<sup>k</sup>*.

The above formulation can be applied within the sequential scheme which known as SISIM. Indicator Kriging (IK) is used to estimate the probability of each category. This algorithm can be described as follow. Similarly, as SGSIM, a random path is defined by which all of the nodes are visited. Then, using Simple Kriging, the indicator random variable for each category is estimated for each node on the random path based on the neighboring data. Next, the conditional probability density function (*cpdf*) is obtained and a value is randomly drawn from that *cpdf* and assigned to the simulated node. This procedure is repeated sequentially for all the visiting nodes until the simulation grid is completed. By choosing another random path, one can generate another realization. More information on this method can be found in Goovaerts (1997).

### **30.3 Multiple Point Geostatistics (MPS)**

One of the bottlenecks in the two-point based geostatistical simulations is their inability in dealing with complex and heterogeneous spatial structures. Such methods cannot fully reproduce the existing physics and most of their parameters usually do not have an equivalent in the reality. In particular, these methods cannot convey the connectivity and variability when the considered phenomenon contains definite patterns or structures. For example, models containing regular structures cannot be reproduced using the SGSIM method. Thus, increasing the number of points can help reproducing the connectivities and complex features. The MPS methods, indeed, intend to reproduce the physics in natural phenomena and they all are based on a set of training images. Below, some preliminary concepts are first reviewed.

### *30.3.1 Training Image*

Training image (TI) is one of the most important inputs in the MPS techniques. Thus, providing a representative TI, or a set of TIs, is the biggest challenge in the MPS applications. In general, TIs can be generated using the physics derived from process-based methods or statistical methods or by using the extracted and observed rules for each geological system. The TI can be of any type, ranging from an image to statistical properties in space and time. In fact, TIs let us to include subjectivity in the geological modeling, as they are difficult to be taken into account in the traditional statistical methods. In a broader sense, TI can be constructed based on the traditional statistical methods. These outcomes, however, do not represent the deterministic aspects of geological models, as they usually tend to signify the randomness fragment. Geologically speaking, most of the images in natural sciences represent some degree of complexity and uniqueness. Some examples of the available **TI**s are shown in Fig. 30.2.

The available methods for constructing the TIs are divided into three main groups:


**Fig. 30.2 a** Wagon Rock Caves outcrop (Anderson et al. 1999), **b** digitized outcrop driven from (**a**), **c** Herten gravel pit (Bayer et al. 2011), **d** litho and hydrofacies distribution extracted from (**c**), **e** a 3D object-based model (Tahmasebi and Sahimi 2016a), **f** some 2D section of the 3D model shown in (**e**), **g** a 2D model generated using the process-based techniques (Tahmasebi and Sahimi 2016a), **h** a 3D model generated by the process-based methods (Tahmasebi and Sahimi 2016a)

algorithm to provide any further alterations. The results of object-based simulation methods are one of the best and most accessible sources for TIs.

• *Process*-*based Methods*: Process-based methods (Biswal et al. 2007, 1999; Bryant and Blunt 1992; Gross and Small 1998; Lancaster and Bras 2002; Pyrcz et al. 2009; Seminara 2006) try to develop 3D models by mimicking the physical processes that form the porous medium. Though realistic, such methods are, however, computationally expensive and require considerable calibrations. Moreover, they are not general enough, because each of them is developed for a specific type of formation, as each type is the outcome of some specific physical processes.

### **30.4 Simulation Path**

Geostatistical techniques are conducted on a simulation grid G, which is constructed on several cells. These cells are visited in diverse ways on a predefined path, either in random or in structural manner (i.e. raster path).

### *30.4.1 Random Path*

Random path is one of the most commonly used visiting path in sequential simulation algorithms. In this particular path, a series of random number equal to the number of unknown cells, based on a random seed, is generated for each realization and the unvisited points on **G** are simulated accordingly. Clearly, the number of simulated (i.e. known) points increase as the simulation proceeds. Each realization is generated using a simulation path. These paths commonly come with unbiasedness around the conditioning point data.

### *30.4.2 Raster Path*

Algorithms based on raster path are popular in the stochastic modeling. These paths are constructed based on structural 1D path, meaning that the simulation cells are visited systematically and one can predict the future visiting points. Daly (2005) presented a Monte Carlo algorithm that utilized raster path. Then, patch-based algorithm was used based on this path by El Ouassini et al. (2008). Next, Parra and Ortiz (2011) used a similar path in their study. Finally, Tahmasebi et al. (2014, 2012a, b) implemented a raster path along a fast similarity computation and achieved high-quality realizations. Such paths usually produced high quality realizations that can barely be produced using the random path algorithms.

One of the advantages in using such paths is the small number of constraints that help the algorithms to better identify the matching data (or patterns). For instance, one only deals with 1–2 overlap regions in 2D simulations, which is much more efficient when four overlaps are used in the random path algorithms. Thus, one should expect more discontinuities and artefacts when then number of overlaps are increased. Indeed, identifying a pattern from TI based on four constraints is very difficult, if not impossible. Therefore, using small number of overlaps is desirable as they result in high-quality realizations. Raster path algorithms offer such a prospect and one can achieve realizations with higher quality.

Dealing with conditioning data (e.g. point and secondary data) is one of the crucial issues in these paths. They, in fact, cannot account for the conditioning data that are ahead of them. Therefore, some biases have been observed in these algorithms, particularly around the conditioning point data. Some complementary methods such as template splitting (Tahmasebi et al. 2012a) and co-template (Parra and Ortiz 2011; Tahmasebi et al. 2014) have addressed this issue partially.

### *30.4.3 Some Other Definitions*

*Simulation Grid* (**G**): a 2D/3D computation grid on which the geostatistical modeling is performed and is composed of several cells, depending on the size of domain and simulation. It contains no information for unconditional simulation, while the hard data are distributed in their corresponding cells.

*Data*-*Event*: a set of points that are characterized by a distance, namely lag, which are considered around a visiting point (cell) on G.

*Template*: a set of points that are organized systematically and used for finding similar patterns in TI.

### **30.5 Current Multiple Point Geostatistical Algorithms**

Generally, the MPS methods have been developed in both pixel- and pattern-based states, each of which, as discussed, have similar pros and cons. For example, the pixel-based MPS methods can perfectly match the well data, whereas, these methods, in some complex geological models, produce unrealistic structures. On the other hand, pattern-based techniques bring a more accurate representation of the subsurface model, while they usually miss the conditioning data. The pattern-based methods simulate a group of points at a time. Currently, these techniques are under different progress, due to their ability for simultaneous reproduction of conditioning data and geologically realistic structures. As mentioned, conditioning to well data is one of the critical issues in the pattern-based techniques. Thus, taking advantage of the capabilities of both pixel- and pattern-based techniques in the MPS methods through the hybrid frameworks will result in an efficient algorithm. Such a combination is reviewed thoroughly in this chapter as well.

Most of the available MPS methods can be used with non-stationary systems, the ones in which the statistical properties of a region is different from other parts (Chugunova and Hu 2008; Honarkhah and Caers 2012; Mariethoz et al. 2010; Strebelle 2012; Tahmasebi and Sahimi 2015a; Wu et al. 2008).

### *30.5.1 Pixel-Based Algorithms*

### i. *Extended Normal Equation Simulation (ENESIM)*

The ENESIM is the first method wherein the idea of MPS was raised (Guardiano and Srivastava 1993). This method is based on an extended concept of indicator kriging, which allows reproduction of multiple-event inferred from a TI. It first finds the data even at each visiting point and then scans the TI for identifying all occurrences. Then, a conditional distribution for all the identified occurrences is constructed. Next, a sample from the generated histogram is drawn and placed in the visiting point on **G**. One of the main drawbacks of this algorithm is scanning the TI for each visiting point, which makes it unpractical for large **G** and TI. This algorithm was later redesigned in the SNESIM algorithm by aid of search tree so one does not need to rescan the TI for each visiting point, but it can be done once before the simulation begins. Some of the results of this algorithm are presented in Fig. 30.3.

### ii. *Simulated Annealing*

Simulated annealing (SA) is one of the popular methods in optimization that is used to the global minima. Suppose *E* represent the energy:

$$E = \sum\_{j=1}^{n} \left[ f\left(\mathbf{O}\_{j}\right) - f\left(\mathbf{S}\_{j}\right) \right]^{2} \tag{30.6}$$

where **O***<sup>j</sup>* and **S***<sup>j</sup>* represent the observed (or measured) and the corresponding simulated (calculated) properties of a porous medium, respectively, with *n* being the number of data points. If there are more than one set of data for distinct properties of the medium, the energy *E* is generalized to

$$E = \sum\_{i=1}^{m} a\_i E\_i \tag{30.7}$$

**Fig. 30.3** The results of the ENESIM algorithm. **a** cross-bedded sand, which is used as TI, **b** one realization generated using ENESIM, **c** TI: fractured model generated using physical rock propagation, **d** one conditional realization based on the TI in (**c**) and 200 conditioning data points. The results are browed from Guardiano and Srivastava (1993)

where *Ei* is the total energy for the data set *i*, and *ω<sup>i</sup>* the corresponding weight, as two distinct set of data for the same porous medium do not usually have the same weight or significance.

An initial guess is usually considered as the structure of medium by which the algorithm can start. Then, a small perturbation is made on the initial model and the new energy *E*′ and the difference Δ*E* = *E*′ − *E*, are then computed. Based on a probability this interchange is then accepted. The interchange is then accepted with a probability *<sup>p</sup>*ð Þ <sup>Δ</sup>*<sup>E</sup>* . Then, according to the Metropolis algorithm,

$$p(\Delta E) = \begin{cases} 1, & \Delta E \le 0, \\ \exp\left(-\Delta E/T\right) & \Delta E > 0, \end{cases} \tag{30.8}$$

where *T* is a fictitious temperature.

Based on statistical mechanics, it is well known that the equilibrium state of ground state can be achieved when it is heated up to a high temperature *T* and then slowly cooled down to absolute zero. The cooling is usually considered slow to allow the system to reach its true equilibrium state. It, indeed, allows the systems to not trap in a local energy minimum. At each annealing step *i* the system is allowed to evolve long enough to "*thermalize*" at *T i*ð Þ. Then, the temperature *<sup>T</sup>* is decreased and continues until the true ground state of the system is reached. This process stops when the *E* is deemed to be small enough. This method has been used widely for reconstruction of fine-scale porous media by Yeong and Torquato (1998a, b), Manwart et al. (2000) and Sheehan and Torquato (2001), as well as large-scale oil reservoirs.

Using this framework, the algorithm starts on a spatially random distribution for the simulation grid **G**. It should be noted that the hard data are placed in the **G** at the same time. Afterwards, the simulation cells are visited and the energy of realization (i.e. global energy) is calculated using the terms considered in the objective function. Then, the probability of acceptance/rejection is calculated and the new value will be used/ignored accordingly. Consequently, the objective function will be updated and next *T* can be defined afterwards. This process continues until the predefined stopping criteria are meet.

Deutsch (1992) used this algorithm to reproduce some MPS properties. He considered an objective function that satisfies some constrains such as histogram and variogram. Furthermore, some researchers applied simulated annealing for simulation of continuous variables (Fang and Wang 1997). However, simulated annealing has drawbacks, a major one being CPU time. Therefore, one can only consider a limited number of statistics as constrains, because increasing the number of constrains has a strong effect on CPU time. In addition, this algorithm has many parameters which should be tuned and therefore need a large amount of trial and error to achieve optimal values. Peredo and Ortiz (2011) used speculative parallel computing to accelerate the simulated annealing; however the computation times are still far from what is obtained with sequential simulation methods (Deutsch and Wen 2000). A overall comparison between the SA algorithm and the traditional algorithms is presented in Fig. 30.4. In a similar fashion, the multiple point statistical methods have also used the effect of iterations on removing the artifacts (Pourfard et al. 2017; Tahmasebi and Sahimi 2016a). It should be noted that the sequential algorithms can be parallel using different strategies which are not discussed in this review chapter (Rasera et al. 2015; Tahmasebi et al. 2012b).

### iii. *Markov Random Field (MRF)*

These models incorporate constraints by formulating high-order spatial statistics and enforcing them on the simulated domain using a Metropolis-Hastings algorithm. In this case, the computational problem of the previous methods remains because the Metropolis-Hastings algorithm, although always converging in theory, may not converge in a reasonable time. The model parameters are inferred from the available data, namely TI.

The Markovian properties are usually expressed as a conditional probability:

$$p(\mathbf{Z}|\text{all previous Z}) = p(z\_1)p(z\_2|z\_1)\dots \underbrace{p(z\_N|z\_{N-1}, z\_{N-2}, \dots, z\_2, z\_1)}\_{p(z\_N|\mathbf{Z}\_{\Phi\_N})}\tag{30.9}$$

SA

**Fig. 30.4** A comparison between the results of the SA algorithm and traditional two-point based geostatistical simulations (Deutsch 1992). It should be noted that the results of the SGESIM, SISIM and SA algorithms are generated based on the TI shown in Fig. 30.3a. The last raw is browed from Peredo and Ortiz (2011)

where **<sup>Z</sup>**<sup>Φ</sup>*<sup>N</sup>* indicates the conditional probability of *zN* and *p z*ð Þ *<sup>N</sup>* > 0∀*zN*.

Fully utilizing the MRF algorithm for large 3D simulation grids in earth science is not practical. Thus, researchers have focused on less computationally demanding

**Fig. 30.5** An illustration of the MMM method. The gray cells represent the unvisited points. The neighborhood is shown in a red polygon. This figure is taken from Stien and Kolbjørnsen (2011)

algorithms such as Markov Mesh Models (MMM) (Daly 2005; Stien and Kolbjørnsen 2011; Toftaker and Tjelmeland 2013). In this algorithm, the simulation is only restricted to a reasonable small window around the visiting point, see Fig. 30.5. Thus, Eq. (9) can be shorten as:

$$p(\mathbf{Z}) = p(z\_1)p(z\_2|z\_1)\dots \underbrace{p(z\_n|z\_{n-1}, z\_{n-2}, \dots, z\_2, z\_1)}\_{p(z\_n|\mathbf{Z}\_{\Phi\_n})}\tag{30.10}$$

where *n* ≪ *N*.

Tjelmeland and Eidsvik (2005) used a sampling algorithm that incorporates an auxiliary random variable. These methods suffer from extensive CPU demand and instability in convergence. Besides, the large structures cannot be reproduced finely, a series of factors that make them difficult to use for 3D applications. Some of the results of this method are shown in Fig. 30.6.

### iv. *Single Normal Equation Simulation (SNESIM)*

The single normal equation simulation (SNESIM) is an improved version of the original algorithm proposed by Guardiano and Srivastava (1993). The SNESIM algorithm scans the input TI for once and then stores the frequency/probability of all pattern occurrences in a search tree (Boucher 2009; Strebelle 2002), which reduces the computational time significantly. Then, the probabilities are retrieved from the constructed search-tree based on the existing data in the data-event. The SNESIM algorithm is a pixel-based algorithm, which can perfectly reproduce the conditioning point data.

The SNESIM algorithm is a sequential algorithm and, thus, each cell *S* can take *<sup>k</sup>* possible states f g *sk*, *<sup>k</sup>* = 1, ... , *<sup>K</sup>* , which usually represents facies unit. This algorithm, like any other conditional techniques, calculates the joint probability over *n* discrete points using:

**Fig. 30.6** A demonstration of the results of MRF (Daly 2005; Stien and Kolbjørnsen 2011). **a**, **c** TI and **b**, **d** realizations

$$\Phi(\mathbf{h}\_1, \dots, \mathbf{h}\_n; k\_1, \dots, k\_n) = E\left\{ \prod\_{a=1}^n I(\mathbf{u} + \mathbf{h}\_a; k\_a) \right\} \tag{30.11}$$

where *h*, *k*, *u* and *E* represent separation vector (lag), state value, visiting location and expected value, respectively. *<sup>I</sup>*ð Þ **<sup>u</sup>**; *<sup>k</sup>* also denotes the indicator value at location *<sup>u</sup>*. This equation, thus, gives the probability of having *<sup>n</sup>* values <sup>ð</sup>*k*1, ... , *kn*Þ at the locations *<sup>s</sup>*ð Þ **<sup>u</sup>** <sup>+</sup>**h**<sup>1</sup> , ... , *<sup>s</sup>*ð Þ **<sup>u</sup>**<sup>+</sup> **<sup>h</sup>***<sup>n</sup>* . The above probability is replaced with the following equation in SNESIM:

$$\Phi(h\_1, \dots, h\_n; k\_1, \dots, k\_n) \cong \frac{c(d\_n)}{N\_n} \tag{30.12}$$

where *Nn* and *c d*ð Þ*<sup>n</sup>* denote the total number of patterns in the TD and number of replicates for the data event *dn* <sup>=</sup> *<sup>s</sup>*ð Þ **<sup>u</sup>**<sup>+</sup> **<sup>h</sup>***<sup>n</sup>* <sup>=</sup>*sk<sup>α</sup>* f g , *<sup>α</sup>* = 1, ... , *<sup>n</sup>* .

**Fig. 30.7** Demonstration of multiple-grid approach in SNESIM. The figure is taken from Wu et al. (2008)

This algorithm benefits from multiple-grid by which the large structures are first captured using a smaller number of nodes and then the details are added. This concept is illustrated in Fig. 30.7.

One of the limitations in the SNESIM algorithm is lack of producing realistic, highly connected and large-scale geological features. This algorithm, however, can be used only on categorical TIs. The SNESIM algorithm is still inefficient for the real multimillion cells applications (Tahmasebi et al. 2014). Several other methods were latter proposed to improve the efficiency and quality of the SNESIM algorithm (Cordua et al. 2015; Straubhaar et al. 2013). A new technique has recently been presented that can take the realizations and perform a structural adjustment to match the well data (Tahmasebi 2017) (Fig. 30.8).

### V. *Direct Sampling*

Direct sampling method is very similar to SIMPAT algorithm (see below) in that sense it only scans a part of TI and pastes one single pixel (Mariethoz et al. 2010). Since the TI is scanned in each loop of the simulation, thus, there is no need to make any database and less RAM is required. Like the pattern-based techniques, this algorithm uses a distance function for finding the closest patterns in TI. This method can be used for both categorical and continuous variables.

The DS algorithm selects the known data at each visiting point. Then, the similarity of the data-event with the TI is calculated based on a predefined searching portion. As soon as the first occurrence of a matching data event in the TI is found (corresponding to a distance under a given threshold acceptance), the value of the central node of the data event in the TI is accepted and pasted in the simulation. It

**Fig. 30.8** The results of SNESIM. The realizations shown in (**a**, **b**) are generated using the TI in Fig. 30.6c. **c** TI and **d** a realization based on the TI in (**c**)

should be noted that the searching phase stops if the algorithm finds a pattern that is similar up to a given threshold. If not, the most similar pattern found in the predefined portion of TI is selected and its central node is pasted on the simulation grid (Fig. 30.9).

### vi. *Cumulants*

More information beyond two-point statistics can be inferred using the cumulants approach. This method, indeed, can extract such a higher information directly from the existing data, rather than the TI. Dimitrakopoulos et al. (2010) first used this method to simulate geological structures. The geological process, anisotropy and pattern redundancy are the important factors that should be considered in selecting the necessary cumulants (Mustapha and Dimitrakopoulos 2010). The conditional probability is first calculated based on the available data. Then, the TI is only researched if not sufficient replicates cannot be found in the data. One requires selecting appropriate spatial cumulants for each geological scenario and there is no specific strategy on this. Some of the results of this method are shown in Fig. 30.10.

**Fig. 30.9** The results of the DS algorithm for modeling of a hydraulic conductivity field (upper row) and a continuous property (**b**). The results are taken from Mariethoz et al. (2010) and Rezaee et al. (2013)

### *30.5.2 Pattern-Based Algorithms*

Pixel-based algorithms can have problems to preserve the continuity of the geological structures. To palliate this, some pattern-based methods have been developed which briefly are introduced bellow. Their commonality is that they do not simulate one pixel at a time, but they paste an entire "patch" in the simulation. One of the main aims of using pattern based simulation methods is their ability to preserve the continuity and overall structure observed in TI.

### i. *Simulation of Pattern (SIMPAT)*

The algorithm of simulation of patterns was first introduced to address some of the limitations in the SNESIM algorithm, namely the CPU time and connectivity of patterns (Arpat and Caers 2007). This method replaces the probability with a distance for finding most similar pattern. The algorithm can be summarized as follows.

**Fig. 30.10** The results of cumulants for modeling of two complex channelized systems. The results are taken from Mustapha and Dimitrakopoulos (2010, 2011)

The TI is first scanned using a predefined template *T* and all the extracted patterns are stored in a pattern database. Then simulation points are visited based on the given random path and the corresponding data-event is extracted accordingly. One of the patterns in pattern database is selected randomly if the data-event at the visiting point contains no data. Otherwise, the most similar pattern is selected based on the similarity between the data-event the patterns in pattern database. The above steps are repeated for all visiting points. The results of SIMPAT algorithm are realistic. However, it requires an extensive CPU time and encounters various serious issues in 3D modeling. Furthermore, the produced results manifest a considerable similarity with TI as this algorithm seeking for the best matching pattern. Thus, this method seems to underestimate the spatial uncertainty. Some of the results of SIMPAT are shown in Fig. 30.11.

In a similar fashion, the pattern-based techniques can be used within a Bayesian framework (Abdollahifard and Faez 2013). This process, however, can be very

**Fig. 30.11** The results of SIMPAT. These results are taken from Arpat (2005)

CPU demanding. Other enhancements on SIMPAT was also considered later by incorporating wavelet decomposition (Gardet et al. 2016).

### ii. *Filter*-*based Simulation (FILTERSIM)*

As pointed out, SIMPAT suffers from its computational cost, as it requires calculating the distance between the data-event the entire patterns in pattern database. One possible solution is to summarize both the TI and data-event. Zhang et al. (2006) proposed a new method, FLITERSIM, in which various filters (6 and 9 filters in 2D and 3D) have been used in order to reduce the spatial complexity and dimensions. This allows reducing the complexity and computation time. Thus, the patterns are first filtered using the pre-defined linear filters. Then, the outputs are clustered based the similarity of the filtered patterns. Next, a prototype pattern is computed for each cluster that represents the average of all the patterns in the cluster. Afterwards, similar to SIMPAT, the most similar prototype is identified using a distance function and one of its patterns is selected randomly. These steps are continued until the simulation grid is filled. Due to using a limited number of filters, this algorithm requires less computational time compared to SIMPAT. The distance function in FILTERSIM was later replace with wavelet (Gloaguen and Dimitrakopoulos 2009). The drawbacks of the wavelet-based method are that it has a lot of parameters (e.g. wavelet decomposition level) that can effect on both quality and CPU time. Such parameters require an extensive tuning in order to achieve good results in a reasonable time.

In a similar fashion, Eskandari and Srinivasan (2010) proposed Growthism to integrate the dynamic data in the simulation. This method begins with the locations of data and grows gradually and completes the simulation grid.

The most important shortcoming of FILTERSIM is that, it uses a limited set of linear filters that cannot always convey all the information and variability in the TI. Moreover, selecting the appropriate filters and several user-dependent parameters for each TI is an issue that is that common among many MPS methods. Some of the generated realizations using the FILTERSIM algorithm are shown in Fig. 30.12.

### iii. *Cross*-*Correlation based Simulation (CCSIM)*

One of the recent algorithms of MPS is the cross correlation-based simulation (CCSIM) algorithm that utilizes a cross-correlation function (CCF) along a 1D-raster path (Tahmasebi et al. 2012a). The CCF, which represents a multi-point

Realization #1 Realization #2

**Fig. 30.12** The results of FILTERSIM. These results are taken from Zhang et al. (2006)

characteristic function, is used to match the patterns in a realization with those in the TI. This algorithm has been adopted for different scenarios and computational grids. For example, multi-scale CCSIM (Tahmasebi et al. 2014) can be used when the simulation grid is very larger. This algorithm, similar to the SNESIM algorithm, is also based on calculating the joint probability. The SNESIM algorithm calculates the probability using a search-tree algorithm. However, calculating the above conditional probability for every single point in the simulation grid, in the presence of a large 2D/3D TD, is computationally prohibitive. Unless a very small neighborhood is used, which leads to poorly connected features in the outcome realizations.

In the CCSIM algorithm, the above limitation is addressed differently. First, the probability function is replaced with a similarity function, called cross-correlation function (CCF), which is much more efficient than drawing the probability. Secondly, based on the Markov Chain theory, the CCSIM algorithm uses a similar search template (i.e. radius). However, unlike the previous algorithms where all the data in the search template are used for simulating the visiting point, only a small data-event located in the boundaries, called overlap region OL, is considered in the calculations. Furthermore, except the point fallen in the OL region, the rest of the points are removed from the visiting points and they would not be simulated again. Thus, instead of simulating each single point, this algorithm ignores some of them and partitions the grid into several blocks. The CCF can be calculated as follow:

$$\mathbf{C\_{TD,D\_T}}(i,j) = \sum\_{x=0}^{D\_x - 1} \sum\_{\mathbf{y}=0}^{D\_y - 1} \mathbf{TI}(x+i, \mathbf{y}+j) \mathbf{D\_T}(x, \mathbf{y}),\tag{30.13}$$

with *<sup>i</sup>*<sup>∈</sup> <sup>½</sup>0*Tx* <sup>+</sup> *Dx* <sup>−</sup>1<sup>Þ</sup> and *<sup>j</sup>*<sup>∈</sup> <sup>½</sup><sup>0</sup> *Ty* <sup>+</sup> *Dy* <sup>−</sup>1<sup>Þ</sup> and *<sup>i</sup>*, *<sup>j</sup>*<sup>∈</sup> *<sup>Z</sup>*. The *<sup>i</sup>* and *<sup>j</sup>* represent the shift steps in the *<sup>x</sup>* and *<sup>y</sup>* directions. **TI**ð Þ *<sup>x</sup>*, *<sup>y</sup>* represents the location at point ð Þ *<sup>x</sup>*, *<sup>y</sup>* of TD of size *Lx* <sup>×</sup> *Ly*, with *<sup>x</sup>*<sup>∈</sup> f g 0, ... , *Dx* <sup>−</sup> <sup>1</sup> and *<sup>y</sup>*<sup>∈</sup> 0, ... , *Dy* <sup>−</sup><sup>1</sup> . An OL region of size D*<sup>x</sup>* × D*<sup>y</sup>* and a data event **D***<sup>T</sup>* are used to match the pattern in the TI. T represents the size of template used in CCSIM.

The CCSIM algorithm can realistically reproduce the large-scale structures in diminutive time. These techniques, however, do not fully match the well data and some artefacts are generated around the point conditioning data. Recently, this techniques has been used within an iterative framework along with boundary cutting methods by which the efficiency and conditioning data reproduction have been increased significantly (Gardet et al. 2016; Kalantari and Abdollahifard 2016; Mahmud et al. 2014; Moura et al. 2017; Scheidt et al. 2015; Tahmasebi and Sahimi 2016a, b; Yang et al. 2016). Some of the results of CCSIM are shown in Fig. 30.13. Furthermore, this method has been successfully implemented for fine-scale modeling in digital rock physics (Karimpouli et al. 2017; Karimpouli and Tahmasebi 2015; Tahmasebi et al. 2016a, b, c, 2017a, b).

**Fig. 30.13** The results of the CCSIM algorithm

### *30.5.3 Hybrid Algorithms*

Each of the current MPS has some specific limitations. For example, the pixel-based techniques are good in conditioning the point data, while they barely can produce long-range connectivities. Similarly, the pattern-based methods can produce such structures, but they are unable to preserve the hard data. Thus, the idea of hybrid MPS method can be interesting if one uses the strength of both group effectively. Following, the available hybrid methods are reviewed and their advantages and disadvantages are discussed as well.

i. *Hybrid Sequential Gaussian/Indicator Simulation and TI*

Ortiz and Deutsch (2004), under an assumption of independence of the different data sources, integrate the indicator method with MPS. Hence, instead of using a TI, the MPS properties are obtained directly from the available hard data (variogram) and integrated with the results of indicator kriging. Finally, a value is drawn from this new distribution. These methods were further investigated by Ortiz and Emery (2005). However, in most cases, the initial results of indicator kriging highly influence final realization.

### ii. *Hybrid Pixel*- *and Pattern*-*based Simulation (HYPPS)*

The strength of both the pixel- and pattern-based algorithms can be combined and make a hybrid algorithm. Tahmasebi (2017) has combined these two algorithm and proposed a new hybrid algorithm, called HYPPS. This algorithm discretizes the simulation grid into regions with/without the conditioning data. One needs to consider more attention the location containing the well data as providing them require considerable cost. Thus, the SNESIM algorithm, as a pixel-based method can be used around such locations. Regardless of the type of any geostatistical methods, reproducing of patterns and well data are the most important factors. Producing realistic models, without taking into account the conditioning data, or vise versa, is not deasirable. Any successful algorithm should be able to manintin both of the above crietra at the same time.

In the HYPPS algorithm the simulation grid is divided into two regions. Then, the geostatistical methods are applied on each of them. Following this step, the HYPPS algorithm uses the CCSIM, as a pattern-based algorithm, for the location where no HD exist and, similarly, the SNESIM algorithm is used around the well data, which can precisely reproduce the conditioning data. Thus, the hybrid state of the pixel-based and pattern-based techniques can be written as follow:

$$\Phi(\mathbf{h}\_1, \dots, \mathbf{h}\_n; k\_1, \dots, k\_n) = E\left\{ \prod\_{a=1}^n I(\mathbf{u} + \mathbf{h}\_a; k\_a) \right\} + \prod\_{a=1}^n \Phi(\mathbf{h}\_a | \mathbf{h}\_{\Phi\_a}) \qquad (30.14)$$

which implies that the joint event over a template where both methods are working simultaneously can be expressed as the summation of the two probability distributions defined earlier (see the SNESIM algorithm). Thus, a normalization terms, namely *nx* and *np*, should be included such that *nx* +*np* = 1. Note that *nx* and *np* represent normalized number (or percentage) of the simulated points used in the pixel- and pattern-based methods. An equivalent form of the above probability can be expressed as:

$$\Phi(\mathbf{h}\_1, \ldots, \mathbf{h}\_n; k\_1, \ldots, k\_n) \cong \left( \begin{array}{c} n\_x \left( \frac{c(d\_n)}{N\_a} \right) + \\ n\_p \left( \sum\_{x=0}^{\mathcal{D}\_x - 1} \sum\_{y=0}^{\mathcal{D}\_y - 1} \mathbf{T} \mathbf{I}(x+i, y+j) \mathbf{D}\_T(x, y) \right) \end{array} \right) / n\_x + n\_p \tag{30.15}$$

The second term in the above equation on the right side is used for the areas where the CCSIM algorithm is utilized. While, the visiting points around the well data are evaluated jointly.

It is worth mentioning that co-template (Tahmasebi et al. 2014) can be used with the CCSIM to give the priority to the patterns that contain the conditioning data ahead of the raster path. Therefore, long-range connectivity structures are taken into

**Fig. 30.14** The application of the SNESIM algorithm for simulating a grid when the borders are conditioned (**a**): the TI, **b**: boundary data, **c** generated realizations using the boundary data, **d**: boundary data along with well data, **e** generated realizations using the boundary and well data. Note, the sizes of the TI and simulation grid in (**a**) and (**b**) are 250 × 250 and 100 × 100, respectively. The shale and sand facies are indicated by blue and red colors, individually

account even in the blocks with no conditioning data. An example is provided in Fig. 30.13e. Although the density of conditioning data, compared to the previous scenario, is increased, but the produced realizations represent the real heterogeneity represented in the TI. The HYPPS algorithm can be used to integrate data at different scales as well (Fig. 30.14).

### **30.6 Current Challenges**

The MPS techniques have been developed extensively to deal with complex and more realistic problems in which the geology and other source of data such as well and secondary data are reproduced. There are still some critical challenges in MPS that require more research. Some of them are listed below:


Many issues must be addressed yet. For example, the current MPS methods are designed for stationary TIs, whereas the properties of many large-scale porous media exhibit non-stationary features. Some progress has recently been made in this direction (Chugunova and Hu 2008; Honarkhah and Caers 2012; Tahmasebi and Sahimi 2015a). In addition, associated with every TI is large uncertainties. Thus, if several TIs are available, it is necessary to design methods that can determine which TI(s) to use in a given context.

### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **Chapter 31 When Should We Use Multiple-Point Geostatistics?**

**Gregoire Mariethoz**

**Abstract** Multiple-point geostatistics should be used when there is either too little or too much information available for other types of geostatistics.

### **31.1 Under-Informed Versus Over-Informed Models**

For a long time, the classical geostatistical framework required moderate amounts of knowledge. Too little knowledge (few hard data, poorly distributed, absence of auxiliary information), makes it difficult to infer the parameters of a covariance model. In the other extreme, too much knowledge risks revealing characteristics of the underlying field that are too complex to be represented by a handful of covariance model parameters. These two situations can be denoted respectively under-informed and over-informed models. In-between these extremes, we have the moderately informed case where it is convenient to use the covariance-based geostatistical framework, which has been—and still is—a very solid basis for building models that incorporate spatial and temporal variability.

Extreme under-informed and over-informed cases have often presented technical challenges, for which practical workarounds are used. For under-informed cases, standard geostatistical practice consists for example in including interpretative knowledge to guide variogram fitting when too few hard data are available. This is one of the reasons for the common recommendation to fit variograms by hand (e.g. Olea 1999). The question of designing spatial models for over-informed cases (i.e., when large amounts of data are available) is relatively recent, with the development of improved sensors and high-resolution numerical models that triggered the era of "big data".

The concept of multiple-point statistics (MPS) appeared in the early 1990s, initially as a means of overcoming extreme under-informed situations. The idea, at

G. Mariethoz (✉)

Institute of Earth Surface Dynamics (IDYST), University of Lausanne, Lausanne, Switzerland e-mail: gregoire.mariethoz@unil.ch

<sup>©</sup> The Author(s) 2018 B. S. Daya Sagar et al. (eds.), *Handbook of Mathematical Geosciences*, https://doi.org/10.1007/978-3-319-78999-6\_31

the time developed by Guardiano and Srivastava (1993) under the impulsion and guidance of A. Journel, was to give the modeler improved tools to include interpretative knowledge in spatial models. The fundamental novelty of the MPS framework was to encapsulate in a training image the interpretative knowledge on the spatial structure of the modeled phenomenon. Since an image is an object most people are familiar with, it allows combining different types of expertise and data, in particular from people who are not familiar with geostatistics.

This approach naturally leads to disregarding hard data as a tiny fraction of the information to include in a model, implying that data alone are not enough. Then, an important part of the modeling work resides in the design of the training image, which can be difficult as natural images are typically not sufficiently repetitive or stationary. Unsurprisingly, the first successful applications of MPS took place in fields where data are typically few, uncertain and expensive, such as reservoir modeling, soil science or mining. In those domains, MPS is often seen as an alternative to object-based methods. Later, it was found that the concept of training image could also be used to incorporate large amounts of information in a model, and therefore address over-informed and data-rich situations, where an increasing number of applications are taking place.

### **31.2 MPS Versus Covariance-Based Geostatistics**

These different aspects have resulted in MPS being seen as in opposition with covariance-based geostatistics. Indeed, from a traditional statistics point of view, MPS is not rigorous in many respects: for instance there is no real model inference, the uncertainty that can be estimated based on a set of MPS realizations is poorly defined, and extreme events cannot be produced beyond those found in the training image. Emery and Lantuéjoul (2014) have shown, based on thorough numerical and theoretical investigations, that MPS only produces random fields when the size of the training image tends to infinity. With a finite training image, MPS algorithms do no longer approximate a random function. Their value then lies in their capability to automatically generate realistic model realizations, but without control of the underlying statistical model. These issues make MPS methodologically close to machine learning and computer graphics. As a result, when using MPS, one often has to make compromises with random function theory and model consistency. In return, it may be possible to explore the data in new ways and obtain, in some cases, models that are more in line with the unobserved physical reality (Journel 1993).

While the hypotheses and tools used are very different, the domains of application of MPS are essentially the same as traditional geostatistics, consisting in the simulation of either conditional or unconditional random fields, mainly for geoscience applications. As such, MPS and covariance-based geostatistics can be seen as competing, and it is not very surprising that in the last decade there have been many cases of fierce debate between the promoters of these two concurrent approaches (Journel and Zhang 2006; Li et al. 2015). My view is that in fact, the two sets of methods should not be seen as opposed, but as complementary approaches. They are complementary because they are able to solve different types of problems which can be distinguished by the nature and amount of information at hand. Seeing the covariance-based and the algorithm-based approaches as opposed can distract from the higher goal of building on the strengths of each approach. The risk has been stated by Breiman (2001) on the topic of machine learning methods: "*statisticians have ruled themselves out of some of the most interesting and challenging statistical problems that have arisen out of the rapidly increasing ability of computers to store and manipulate data*".

When the available data and knowledge on the studied phenomenon allow building a random function model, using covariance-based geostatistics is usually appropriate. There are numerous examples of successful models designed in this framework for which it would be very difficult to apply MPS (e.g. Diggle et al. 1998; Goovaerts 2005). Conversely, there are applications where the use of training images and MPS algorithms are better able to address some practical questions. In the next sections, I will show two such examples where the available information is either extremely poor or extremely rich. Applying covariance-based geostatistics to these examples would likely yield unsatisfactory results. I emphasize here that for the purpose of demonstration, I am exclusively focusing on examples that are tailored for the application of MPS. Countless examples can be found for which covariance models are perfectly applicable, but it is beyond the scope of this short chapter to show them here.

### **31.3 Examples for Which MPS Works Well**

### *31.3.1 MPS Can Be Used in Extreme Under-Informed Situations*

An example of extreme under-informed model is the common problem of interpolating rainfall data over a given area based on a small number of rain gauges. Rainfall is an inherently intermittent and highly spatially variable process (Benoit and Mariethoz 2017). Moreover, in some cases rain gauge data can be of poor quality, and it is not uncommon to only have binary wet/dry information (as opposed to rainfall accumulation). An example of such poor dataset is shown in Fig. 31.1, with synthetic rain observations consisting of 30 rain gauges. While this case is synthetic, the setting is relatively standard in terms of data density and heterogeneity. It is quite clear that 30 observation points are insufficient to properly infer a spatial model, which is confirmed by the experimental variogram that shows no spatial structure (and wild fluctuations when the number of lags is varied).

In such a setting, the MPS approach starts by stating that the information contained in the hard data is insufficient. At best, the data points can be used for conditioning, but not for inferring any kind of structural model. Instead, one has to

**Fig. 31.1** Under-informed setting. Left: synthetic rain gauge network made of 30 points with only wet/dry information. Right: experimental omnidirectional indicator variogram of the probability of rainfall, computed on 10 lags

supplement the insufficient data by resorting to external knowledge of the modeled process. For example, one may know the type of rainfall for that specific day. Based on this knowledge, it is possible to collect radar images of rain events of the same type. Rainfall radar images, either ground-based or satellite-based, are typically collected by national weather agencies and made available to the scientific community. Then, using these representative radar images as training images, MPS can be used to generate rain fields conditioned to the gauge data.

Figure 31.2 shows the results of using two different training images to interpolate the data shown in Fig. 31.1, by considering as training image alternatively a cyclone (left) or a tropical storm (right). It is obvious here that the choice of the training image has a strong influence on the results as it determines the types of patterns found in the simulations, as well as global statistics such as the proportion of wet areas.

This example illustrates the conceptual differences between MPS and covariance-based geostatistics. These differences extend beyond the formalism or the algorithms used. While classical geostatistics infer a model based on data, MPS generates additional data based on external knowledge, in this case through the search for and the selection of an appropriate radar image.

### *31.3.2 MPS Can Be Used in Extreme Over-Informed Situations*

The most common situation in geostatistics is to have a handful of data points, and based on these, to estimate the target variable on a large grid. Increasingly in recent years, the opposite situation occurs with a large number of data used to predict the value at a smaller set of locations. One prime example of such over-informed

**Fig. 31.2** Application of MPS for rain occurrence simulation. Left: simulation of binary rainfall based on a training image of a cyclone. Right: same setting based on a training image of a tropical storm. Size of training images: 572 × 584 pixels. Size of simulation grid: 400 × 400 pixels. The Direct Sampling MPS algorithm was used

problems is applications to satellite imagery, which typically consist in large spatial datasets (typically the entire Earth is covered at high spatial resolution) that also present a temporal aspect since the same location is imaged at regular intervals. Here we look at the Landsat 7 ETM + sensor, which has the characteristic that it

**Fig. 31.3** MPS applied to gap-filling of a 5-band Landsat 7 image. Scene acquired on March 22, 2017 in Western Switzerland. Image size: 500 × 500 pixels. The Direct Sampling MPS algorithm was used. Image shown in natural colors

partially failed in 2003, and since then the images it acquires present gaps (as shown on Fig. 31.3a). The goal here is to fill these gaps with simulated values. In such an image, the regions to reconstruct typically represent about 20% of the domain, the rest consisting of conditioning data. These data contain not only local information, but also very rich structural information such as the type of land surface features (fields, forests, cities), the connectivity of the different objects (roads, water bodies), and their spatial arrangement (see details shown in Fig. 31.3c, e).

The application of covariance-based geostatistics is in this case difficult, not because of challenges related to model inference and identification (as in Fig. 31.1), but because standard simulation techniques, such as Sequential Gaussian Simulation or turning bands, will likely result in artifacts that are clearly visible to the eye. Indeed, the complex land surface information cannot be entirely represented by covariance models which are typically represented by a small number of parameters. Furthermore, although interpolation artifacts are sometimes obvious to the eye, they are typically undetectable by standard statistical metrics because these metrics are based on covariance (or two-point statistics) and cannot identify complex patterns such as connectivity, for which the human eye is very well suited. It can of course be argued that there are applications where these complex properties do not matter; but if they do, the covariance-based framework is inappropriate (Zinn and Harvey 2003).

In contrast, applying MPS to this gap-filling problem is straightforward. The MPS approach used here for the simulation of gaps is the one presented by Yin et al. (2017a, b). Each color channel is co-simulated and no auxiliary variables are used. Contrarily to the data-poor case, there is no need here to infer, construct or hypothesize a training image. The training image is given by the 80% of the domain that is known. While the training image size is far from infinity, it is a little closer to the ideal situation outlined by Emery and Lantuéjoul (2014). The gap-filling results (Fig. 31.3b, d, f) present very few visual artifacts. In certain places, it is possible to see that some reconstructed elongated features are discontinuous (e.g. the road near the center of Fig. 31.3d). However in most cases it is difficult to distinguish the reconstructed and the original areas (e.g. in Fig. 31.3f).

### **31.4 Conclusion**

Often the debate around MPS and covariance-based approaches has been centered on the dichotomy between multiGaussianity or non-multiGaussianity of the variable to simulate (Gómez-Hernández and Wen 1998). The choice of a simulation approach or algorithm should certainly be driven by the nature of the variable of interest: is it non-multiGaussian? is it non-stationary? is it channelized? do these characteristics matter for a given problem? I argue here that the question of the amount of information at hand is also a critical factor to consider when choosing which simulation framework to use, and this question has often been overlooked. It may make sense to also base this choice on the quantity of information available: do I have a conceptual model? do I have enough hard or soft data to infer a covariance? do I have so much data that I am able to detect non-multiGaussian behavior?

To summarize, one can say that different tools are available, and those should be chosen according to the problem to be solved. While no example with moderate amount of information has been shown in this chapter, it is understood that it is generally the realm of covariance-based geostatistics. Under-informed situations are always going to be difficult because there are important modelling choices to make. For over-informed cases, relatively few assumptions are needed and, with some precautions, it can be possible to rely on algorithms such as MPS.

Better defining the role of MPS in the galaxy of existing spatial modeling tools can potentially help narrowing areas where future MPS research should focus. So far, there has been a strong emphasis on the development of simulation algorithms. The different algorithms available can reproduce spatial features with various degrees of faithfulness, they may need different computing resources or may offer specific options. While developments in MPS are still needed (in particular regarding training image selection and manipulation, as well as parametrization), the simulation algorithms are becoming quite mature. Moving beyond the dichotomy between covariance-based geostatistics and MPS can enable the development of new hybrid approaches. For example, using distance-based (also known as convolution-based) MPS algorithms can be seen as bootstrapping the training image. However, the link with bootstrapping theory (e.g. Davison and Hinkley 1997) has not yet been fully explored. Similarly, the MPS framework is currently unable to simulate extreme values. Combining MPS with more standard statistical approaches may open new fields of applications, in particular in domains such as climate science, hydrology or earth surface observation where increasingly rich space-time datasets are now available.

### **References**


Li L, Romary T, Caers J (2015) Universal kriging with training images. Spat Stat 14:240–<sup>268</sup> Olea RA (1999) Geostatistics for engineers and earth scientists. Springer, New York


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **Chapter 32 The Origins of the Multiple-Point Statistics (MPS) Algorithm**

**R. Mohan Srivastava**

**Abstract** First proposed in the early 1990s, the geostatistical algorithm known as multiple-point statistics (MPS) now enjoys widespread use, particularly in petroleum studies. It has become part of the toolkit that new practitioners are trained to use in several oil companies; it has been incorporated into commercial software; and research programs in many universities continue to tap into the central MPS idea of extracting statistical information directly from a training image. The inspiration for the development of a proof-of-concept MPS prototype code owes much to several different researchers and research programs in the late 1980s and early 1990s: the sequential algorithms pioneered at Stanford University, the work of Chris Farmer, then at UK Atomic Energy, and the growing use of outcrop studies by several oil companies. This largely accidental confluence of divergent theoretical perspectives, and of distinct practical workflows, serves as an example of how science often advances through the intersection of ideas that are not only disparate but even contradictory.

**Keywords** MPS ⋅ Multiple-point statistics ⋅ Conditional simulation Training image

### **32.1 Introduction**

Through the windows of the cottage, we watched the sun slip behind the trees on the ridge across the lake, turning the light dusting of snow from pink to red to crimson. As darkness settled outside, the windows became mirrors, lit by the flame from the logs in the fireplace, until all we could see was our two reflections, each resting comfortably in an armchair, wine glass in hand. We talked into the late evening, past the rising of the crescent moon, reminiscing about people, about

R. Mohan Srivastava (✉)

TriStar Gold Inc., #1100 – 120 Eglinton Avenue East, Toronto, ON M4P 1E2, Canada e-mail: MoSrivastava@tristargold.com

<sup>©</sup> The Author(s) 2018 B. S. Daya Sagar et al. (eds.), *Handbook of Mathematical Geosciences*, https://doi.org/10.1007/978-3-319-78999-6\_32

ideas, about where it all began. We'd known each other for more than three decades, and were comfortable when talk lapsed into silence … and equally comfortable when silence gave way to a new thought, a different recollection, and the conversation flared up into a dispute about memory, about theory or about practice. Even when the wine bottle stood empty, and the embers in the fireplace seemed to be exhausted, the logs would sometimes adjust, one breaking and settling against another as sparks shot into the air. New fire from old.

It was December 2013, and I had succeeded in having my old advisor from Stanford, André Journel, visit me in Ontario to discuss a joint contribution to the volume on multiple-point statistics being compiled by Grégoire Mariethoz and Philippe Renard. Busy lives kept us from completing that task, but the conversation from that weekend by Lake Muskoka did become enough of an almost-paper that I was grateful for the opportunity of this 50th anniversary volume to complete what we began. Neither André nor I have much to contribute to modern MPS research; we are both "gray hairs" and now stand well back from the fire of leading-edge research. But our hair was once not so gray, and we were there at the beginning when we laid the kindling for what has become a remarkably rich idea. So our offering from that Lake Muskoka discussion is reflections on how the MPS came together. It is a tale familiar to science, with chance encounters, casual remarks that turn out to have great depth, cocktail napkins turned into whiteboards, heads shaking in disagreement: "that can't be right". As we yield the stage to the next generations of researchers, our hope is that others continue to recognize the value of cross-pollination, of interacting with others in the field, especially those who have ideas that contradict one's own beliefs. When one sturdy idea burns and breaks, settling against another, sparks fly and we have our best chance to ignite new understandings of both theory and practice.

### **32.2 1970s**

### *32.2.1 A Hammer Without a Nail*

Although the theory of geostatistical simulation was firmly established by the early 1970s (Journel 1974), it had still not been widely accepted in practice by the end of the decade. The now-venerable turning bands algorithm was the only game in town when one wanted to create a conditional simulation. There were a handful of practical case study example of conditional simulation in the mining industry, but it remained a hard sell in an industry that prefers, even now, to report one single "best" estimate of mineral resources and reserves than to wrestle with a family of equi-probable outcomes.

### **32.3 1980s**

### *32.3.1 Interest in Geostatistics Spreads to the Oil Industry*

Through the 1970s, the oil industry lagged behind the mining industry as an adopter of geostatistics. Many oil companies found value in some of the trend variants of kriging as additional tools in their contouring toolkit, especially when dealing with structural traps where trends are common in the elevation of the top of structure. Kriging with an external drift provided a good contouring solution in faulted reservoirs where seismic data provided strongly correlated indirect measurements of depth to the top of the reservoir (Maréchal 1984). But oil companies had many good contouring methods that worked well without any geostatistics, and it was not until the late 1980s when most of the major oil companies took notice of conditional simulation because it offered something new: the ability to do Monte Carlo analysis with 3D models of a reservoir's rock and fluid properties that honored data and that were geologically plausible.

### *32.3.2 New Simulation Tools and the Struggle for Visual Realism*

At Stanford, where I studied in the 1980s, research was supported by the Stanford Center for Reservoir Forecasting. The SCRF consortium's interest in risk analysis fueled a growing number of new geostatistical algorithms for creating realizations that honored continuous data (typically rock and fluid properties) and categorical data (typically lithologies): sequential indicator simulation (Alabert 1987), LU decomposition (Alabert 1987; Davis 1987), sequential gaussian simulation (Isaaks 1990; Gómez-Hernández 1991).

Despite having new algorithms for the conditional simulation of continuous variables, Stanford's toolkit still struggled to produce convincing simulations of categorical variables such as lithologies in a sand-shale sequence. Although indicator realizations could be made to honor indicator variogram models, the results usually were not convincing as artwork; they simply looked wrong. In Fig. 32.1, much of the (limited) success of the SIS simulation is due to the use of a trend model and to locally varying directions of maximum continuity, and not so much to the indicator kriging or to sequential simulation.

Boolean simulations that stochastically arranged prescribed geometries into a computer model usually won more approval for realism, but because these object-based algorithms were not pixel-based, they had difficulty with conditioning to well data, especially if there were lots of closely-spaced wells. In Fig. 32.1, the SIS realization is conditioned to 240 data points; but the object-based simulation, which produces a more satisfying result, is unconditional.

Through my time as a graduate student at Stanford, the Holy Grail of conditional simulation was a best-of-both-worlds algorithm that had the visual realism of

**Fig. 32.1** Examples of indicator simulation and object-based simulation of fluvial channels. The image at the top shows a training image (a satellite image of the Brahmaputra River) from which indicator variograms were calculated and used to create the SIS realization in the middle frame, conditioned to the data shown as circles. The same training image provides information on the distribution of parameters that describe object geometry; these were used as input to an object-based simulator, FLUVSIM (Deutsch and Tran 2002), to create the unconditional realization at the bottom. Although the object-based simulation succeeds in creating channels that are visually more coherent, it is difficult to condition to known lithologies at specific locations

object-based methods but that conditioned easily to hard data, no matter how dense. There were discussions at that time about the possibility that we might never achieve what we thought we wanted because of the fundamental difference between the statistical characteristics of an image and the meaning that knowledgeable experts extract from the image. In the example in Fig. 32.1, human vision allows us to see the entire set of meandering braided channels. Statistical summaries, especially variograms, do not "see" anything in its entirety; they see the image two points at a time. The analogy that André Journel often used was that it was like a blind person, trying to understand an object in front of him when he was allowed only to probe with the two forefingers. Limited to poking here and poking there, the blind person would struggle to tell the difference between an elephant and a rhinoceros.

The envy of the visual success of object-based realizations, and the desire to maintain the ease of conditioning with pixel-based methods, catalyzed a lot of discussion in the late 1980s about multi-point geostatistics. What would three-point or four-point or n-point variograms look like? How might they be calculated experimentally? How could they be modeled? How should they be used in an improved version of kriging?

### *32.3.3 Outcrops and Scanned Images as Analogs*

In the mining industry, where geostatistics was first embraced, drill hole spacing is typically on the order of tens of meters, close enough that the choice of a variogram model could be based on experimental variograms. In petroleum reservoirs, wells are typically spaced several hundreds of meters apart, sometimes thousands of meters. This practical reality of petroleum applications gave rise to an immediate practical problem when the oil industry took an interest in conditional simulation in the 1980s: where to get the closely-spaced information required to make experimental variograms?

The common advice in the 1980s was that outcrop studies could provide the data required to support statistical and geostatistical parameter choices, such as the length, anisotropy and orientation distributions required for object-based methods, or the variograms required for geostatistical methods. Outcrop studies did not begin in the 1980s; but this was the decade when they flourished. Many of the major oil companies, either individually or in consortiums, funded detailed quantitative studies of outcrops that could serve as good geological analogs for producing fields. And outcrop studies from earlier decades were dusted off and re-purposed as sources for data that could support parameterization of computer models.

Figure 32.2 shows an example of data from a 1960s outcrop study that was re-discovered by several researchers in the 1980s. It was created by digitizing shale streaks from a photograph of a cliff face of an outcrop of the Assakao Formation in the Tassili region of the central Sahara (Dupuy and du Prey 1968). Fifteen years after the data was first presented, Helge Haldorsen used the Assakao outcrop study as the basis for choosing the shale length distribution for object-based simulations

**Fig. 32.2** The Assakao Sandstone data set (from Desbarats 1987). The formation is generally sandstone (white) with occasional shale streaks (black)

of sand-shale sequences for his Ph.D. research (Haldorsen 1983; Haldorsen and Chang 1985). During the time when I studied at Stanford, I shared an office with Alec Desbarats to whom Helge had given the Assakao data for Alec's research on stochastic modeling of flow in sand-shale sequences (Desbarats 1987).

If a good outcrop analog was not available, one could (with fingers crossed and a prayer for absolution of sin) invoke a fractal argument and choose as an analog something with an entirely different scale. At a much larger scale than most reservoirs, satellite imagery, which started to become widely available in the 1980s, could serve as the source of information on spatial statistics. At the regional scale, or even at the scale of very large reservoirs, images like the top frame in Fig. 32.1 could help in sorting out statistical parameters for numerical simulation. And at a much smaller scale, there were scanned images of slabs of sedimentary rock at the scale of hand specimens, such as the example shown in Fig. 32.3.

Digitized images, whether of outcrops or of similar phenomena at different scales provide a basis for calculating not only experimental variograms but also multi-point statistics. When calculated from a rasterized image, the length distribution of shale streaks can be seen as a multi-point statistic. In the Assakao outcrop example shown in Fig. 32.2, where the individuals pixels are 20 × 20 cm, the probability of encountering a shale streak that is 20 m long can be calculated by scanning the image across each row, counting up the number of times we get a white pixel followed by 100 black pixels, then followed by a white pixel … then dividing this by the total number of shales of any length. Alec Desbarats did exactly this in his Ph.D. thesis when he wanted to test the fidelity of the synthetic

### 32 The Origins of the Multiple-Point Statistics (MPS) Algorithm 661

**Fig. 32.3** Digital image of a slab of cross-bedded sandstone from Utah


**Fig. 32.4** Indicator simulation of the Assakao outcrop image in Fig. 32.2 (from Desbarats 1987)

sand-shale sequences he had created using indicator simulation (Fig. 32.4). He knew he had the correct proportion of shales and that he had matched the indicator variogram; but he was curious about how well he had done on the multi-point statistic that Helge Haldorsen controlled directly in his simulations. Figure 32.5 shows the histograms of the shale length distributions from an indicator simulation of the Assakao outcrop, and from the original image; the indicator simulation shows

**Fig. 32.5** Histograms of shale lengths from Fig. 32.2 (left) and Fig. 32.4 (right) (from Desbarats 1987)

more very short shales than does the original image, with a lower mean length and higher variance.

Other similar studies at the same time by François Alabert showed the same result: indicator simulation produces realizations that show more short features and too few long features. The over-representation of short features is also obvious from a visual comparison of indicator simulations to the reality they try to mimic, e.g. the top two frames in Fig. 32.1, or the realization in Fig. 32.4 with the outcrop image in Fig. 32.2. The common explanation given at the time was that when an algorithm controls only the first and second-order moments (histogram, or indicator proportion, and the variogram) then the uncontrolled higher-order moments drift in the direction of disorder or maximum entropy.

### *32.3.4 Leaving the Ivory Tower and Getting on with Adult Life*

My years as a student at Stanford ended in 1988. Sold my bicycle, the one that hadn't been stolen. Gave up the wonderful room I had in a camping trailer behind a house in Palo Alto. Headed off into the world of consulting, with Neil Schofield and Roland Froidevaux as my partners in FSS International Consultants. The notion was simple: Neil and I were familiar with student poverty and didn't mind another year of living with little money. After a year, if we failed as consultants then we could get real jobs.

We managed not to fail, and each of the FSS partners found ourselves busy with clients who wanted advice and assistance with geostatistical studies. My workload was split between mining studies, where simulation was rarely discussed, and petroleum studies, where kriging was rarely discussed.

Even though my mining studies had little to do with stochastic modeling, there was one mining project that, in hindsight, probably planted some useful seeds for what later became the MPS prototype algorithm. It was a project in which some of the useful geological and numerical data were available only from paper records written by hand decades ago: drill logs with assay values transcribed manually. In the late 1980s, software for optical character recognition (OCR) struggled with handwriting; it still does today, but it was worse back then. Even though commercial OCR software could make no sense of the handwritten logs, my sense was that it should be possible to extract much of it automatically, instead of going through a time-consuming and error-prone process of manual data entry. The drill logs were neat and legible, and all of the key numerical values were written in boxes on a form. With only 11 possible characters in use, the ten digits and the decimal point, it seemed possible to me that the handwriting could be recognized by an algorithm that trained itself from actual images. I wrote a program that would search the scanned image (an eight-level grayscale raster), looking for islands of non-white in the appropriate boxes on the form. It would then show what it had found to the user, who would identify the symbol by typing in one of the 11 choices. After a few dozen examples of each of the 11 possibilities, the software was able to estimate the probability that a new small patch corresponded to each of the possibilities. It did this simply by direct pixel-to-pixel matching of grayscale levels, without any clever rescaling or rotation. If it could not establish a sufficiently high probability for one particular choice, it would drop pixels from the comparison and try again. The user would correct it when it made mistakes, and the software would store its acquired collection of confirmed examples in a growing database. As with most of my Mo-code, it took a bit of tinkering to get it to work well; but it ended up being used, and saved weeks of data entry from hundreds of old drill logs. We ended up calling the program "Am-I-Right" because that's how the program worked: by making guesses based on pixel-to-pixel pattern matching, and then checking with the user to see if that guess was correct.

### *32.3.5 Chris Farmer's Unexpected Claim*

1988 was also the year when I first met Chris Farmer, at the SPE Forum on reservoir characterization in Grindelwald, Switzerland. He was working on methods for numerically simulating reservoir rocks, recognized the benefits of a pixel-based approach, and had developed new ideas about what information to extract from outcrop studies and scanned images of analogs (Farmer, 1989). During my early years as a consultant, I managed to visit Chris at the UK Atomic Energy Agency's research centre at Winfrith. During this visit, he made a claim that seemed implausible … no, it actually seemed flat out wrong; but I was raised well by my parents, and knew that it was rude for a guest to precipitate an argument.

We had been talking about extracting indicator variogram and cross-variogram information from scanned images and Chris remarked that you have to be careful when you do this because if you try to make a realization exactly match all of the indicator variograms and cross-variograms from a scanned image, then you'll just get back the scanned image; and the purpose of creating realizations is not to exactly match one "true" image, but instead to sample a space of uncertainty that shares something in common with the original image. I checked if I understood him correctly: did he really mean that you can exactly … exactly … match an image just by reproducing its indicator variograms and cross-variograms? I knew (or thought I knew) that this wasn't true. Even with multiple indicators, all of the variograms and cross-variograms are still two-point statistics; you're still a blind person, feebly prodding either an elephant or a rhino.

Chris clarified that he did mean exactly, with one minor caveat: that you actually get two possible images which are 180° rotations of each other; you might end up with an upside-down elephant, but you'd easily be able to figure out that it wasn't a rhino. And he also explained that he meant that you match to the complete experimental indicator variograms for every possible separation distance and direction on the rasterized image. Even with these caveats, I still found his claim implausible; but kept thinking about why he would be so sure about this.

The other reason it was not worth getting into the details of why Chris was confused was that I agreed with the basic point he was trying to make: the purpose of what we have now taken to calling a "training image" is not to match it, but instead to use it as a guide for selected spatial statistical characteristics. You want to match the statistics, while conditioning to data, not replicate one training image.

### **32.4 1990s**

### *32.4.1 Why Chris Farmer Was Right*

In 1991, the SPE Forum on reservoir characterization was held in Crested Butte, Colorado, and I had a chance to continue the discussion with Chris Farmer about indicator variograms and training images. When I explained, as diplomatically as I could, that I didn't think his claim was correct, he grabbed a nearby napkin, sketched a small grid, and colored in some pixels as black, white and gray. He agreed that I was right if we lived in a world of variogram models for random functions that are infinite in all directions. "But in the real world, things have edges," he explained patiently, "and this means there's only one pair of pixels in the original image that completely span the diagonal". He went on to show how you can actually deduce the grayscale levels for the two corner pixels (up to the 180° rotation) and then work inwards from the corners. The Appendix to this paper shows a small worked-out example of the trick that Chris explained.

As soon as he explained it, and I realized that I was the one who was wrong, Chris dismissed it as an algorithmic oddity, a cute and clever trick that has no practical value for simulating reservoir rocks, especially because the goal is never to exactly replicate the original image.

Even though I understood the principle behind the procedure of attacking the corners first and then working inwards, the algorithm still wasn't clear in my head, and I spent some time that year trying to write code for doing what Chris had described. I never did manage to work out all the special cases, and it ended up on the back burner as one more unfinished project.

### *32.4.2 Back to the Ivory Tower: A Brief Escape from Adult Life*

In late 1991, my consulting business was thriving and growing; I had a small staff in Vancouver, and plenty of project work. But I was spending more time as an administrator and manager, neither of which I am good at, and less time doing the technical work that I enjoy.

My old advisor convinced me that I could let the staff run the show while I spent a year at Stanford, back in the ebb and flow of new ideas with his new crop of graduate students. Twenty five years later, I find it remarkable what was accomplished during that year: P-field simulation, co-located cokriging, and a proof-of-concept algorithm for multiple-point statistics. All of these new geostatistical methods that we investigated in 1992 began with a piece of Mo-code that did something useful, and not with theory; that came later. André comes at research from the side of theory that leads to equations that can be coded and tested. I tend to come at it the opposite way, with a piece of code that achieves a desired result and that then leads to the question "I wonder why that works?".

In the early part of 1992, with the luxury of time to do research again, I dusted off some of my back-burner projects, and came back to my attempt to code Chris Farmer's trick for replicating an image from its indicator variograms and cross-variograms. The details of the algorithm were still a mess, but I realized that I could get very close to a satisfactory result using simulated annealing, a possibility that came to the forefront because Clayton Deutsch was finishing his Ph.D. thesis on simulated annealing that year. I wrote a program that would start with a grid that had exactly the correct proportions of the gray levels, randomly scattered, and that would use simulated annealing to iteratively adjust the image by swapping pixels in order to push the experimental indicator variograms and cross-variograms of the evolving grid in the direction of a target values established by the complete indicator variograms of the original image. No variogram models were used; everything was done using look-up tables of variogram values. I used a photo of André, exhausted after a climb on Mount Whitney, as the training image, converted it to an eight-level grayscale image with seven indicators. 200 columns and rows, seven direct indicator variograms, 21 indicator cross-variograms, all calculated for every one of approximately 80,000 lags on the image. It took four days of run-time and hundreds of millions of swaps before the difference between the indicator variograms of the simulation and the image could not be reduced. It was hopelessly inefficient, but it confirmed for me that Chris Farmer had been right.

For me, the recognition that you can exactly match an image from a very complete and specific statistical summary of specific patterns was an eye-opener. Although it now probably seems fairly obvious, in the early 1990s, the wealth of information contained in an image's statistical summaries was not immediately apparent. Then, the normal workflow was to assemble statistical parameters by fitting models to experimental statistics. The experimental variogram, for example, was an important stepping-stone to a variogram model; but it was only a means to an end. We did not think of the massive look-up table of summary statistics for thousands of grouped pairs of data as something that could serve directly as an input parameter. But why not? Why in an age of computer power did we continue to create simplified mathematical models of statistical characteristics? Was it really necessary to boil the parameterization down to a few numbers, a nugget effect and a range, rather than leave the statistical summary in its original form as a massive look-up table? For me, this was the "aha" moment catalyzed by my belief, years earlier, that Chris Farmer's claim about indicator variograms was not correct. The reason I was wrong was that massive look-up tables of indicator variograms are a rich source of very detailed information. The mistake we were making was that we moved past this wealth of information and replaced it with a simple model.

The idea for the first prototype of an MPS simulation algorithm came from the accidental meeting of thoughts about the role of training images in reservoir simulation and the experience of having coded the Am-I-Right procedure for optical character recognition for a mining project. The principal difference between Am-I-Right and the MPS prototype is that, after scanning the image to build a probability distribution, the Am-I-Right procedure always took the most likely value while MPS used the distribution as a basis for random sampling.

The first tests of the MPS prototype were done on a digital image of a cross-bedded sandstone, like the one shown in Fig. 32.3. This was chosen because it presents curved structures that are difficult to capture with most geostatistical simulations, which tend to show straight features in the direction of maximum continuity unless an explicit attempt is made to use locally varying directions of anisotropy. Figure 32.6 shows the first published results of an MPS simulation (Guardiano and Srivastava 1992). That Tróia '92 paper used a two-level black-and-white training image because the first tests on an eight-level grayscale image were very slow; it would be several years before Sebastien Strebelle's Ph.D. research (Strebelle 2000) produced the first efficient and practical implementation of the original clumsy prototype.

Even though the first results were not brilliant, certainly not by today's standards, they did show that it was possible to impart to a simulation higher-order

**Fig. 32.6** The first published example of results of an MPS simulation (from Guardiano and Srivastava 1992). The frame on the left shows the training image, a black-white image obtained from a digital photograph of a slab of cross-bedded sandstone. The middle frame shows a realization from sequential indicator simulation. The right frame shows a realization from the MPS prototype algorithm

connectivities and patterns that are not explicitly summarized in variograms. In the right frame of Fig. 32.6, it is the black pixels that make the thin curved arcs, while the contiguous regions of white pixels tend to be larger and blockier. The middle frame of Fig. 32.6 shows that these features are hard to capture in an indicator simulation, which tends to symmetrize the black and white geometries when the proportion is near 50%.

### **32.5 Concluding Thoughts**

Where do ideas come from? Is it possible to create fertile conditions for innovation? Of the many who have studied these questions, my favorite is Steve Johnson, who wrote *Where Good Ideas Come From: The Natural History of Innovation*; he has presented his thoughts in a 2010 TED Talk and also in a short YouTube video (https://www.youtube.com/watch?v=NugRZGDbPFU). Much of what Johnson identifies as key elements of innovation are in evidence in the origins of the MPS simulation algorithm: the slow incubation of hunches, the borrowing and combining of ideas from other people with related hunches, the catalytic effect of recognizing error, and of finding the missing piece.

The one piece of Johnson's message that resonates most strongly with my experience is the importance of staying connected to others; he often concludes his presentations with the observation that innovation comes by chance, but chance favors the connected mind. By "connected mind" he means a mind that is connected to what others are doing, how they are thinking about similar problems. It is the hunches and cast-off ideas of those people that you'll end up borrowing and adapting to improve a hunch of your own that has still not reached fruition.

Of the many different ideas that ended up being woven together into the MPS prototype, there may be a dropped thread, something that might be research worth pursuing. It is the fact that complete indicator variograms and cross-variograms provide extremely rich and detailed information about an image, so rich and detailed that they can, in fact, be used to replicate the original image. While replication of a training image should never be a goal, it's intriguing to think about what we might be able to do if we matched a small sub-set of the complete look-up table of all indicator variograms. We know that we get a "perfect" realization if we use 100% of the look-up table. Would the realization look "fairly good" or "completely ugly" if we decimated the complete look-up table and used only 10% of it, or only 1%? My own tests with the annealing version of this procedure, and the example in Appendix A, indicate that the indicator cross-variograms are sometimes not necessary, i.e. that you can achieve nearly perfect reproduction of the original image without them. Dropping all the indicator cross-covariances would considerably reduce the size of the look-up table, or any subset of it. Something worth trying?

My final reflection is on the beneficial tug-of-war between theory and practice. Throughout my career as a consultant, and tourist in academia, I have enjoyed discovering that the path to a solution sometimes starts when you enter the maze from the theory side, and sometimes starts from an entrance on the practical side. When theory leads you to the point of a set of equations, that need not be the end because there may be something useful to be learned in attempting to implement those equations in practice, in writing a piece of computer code that produces an answer in a reasonable amount of time. And, coming from the other end, having developed an algorithm that produces an intriguing result that seems "good" or "right", it's useful to try to work out why it works. Even if the answer came heuristically, the theory that explains why it's an approximately correct answer might reveal a generalization that makes it possible to improve the answer.

### **Appendix: Example of Reconstructing a Grid from Its Indicator Variograms and Cross-Variograms**

Figure 32.7 shows a tiny image with three levels of gray on a 3 × 3 grid. If we give values of 1, 2 and 3 to white, gray and black, the three levels give rise to two indicators: I1 with a threshold between 1 and 2 and I2 with a threshold between 2 and 3. There are two direct indicator variograms, γ<sup>1</sup> and γ2, and one cross-variogram, γ12. The nine locations give rise to 36 paired locations (not including the pairs that have zero separation). These 36 pairs are shown in Fig. 32.8, grouped into the 12 possible lags.

For any lag, the experimental indicator variogram is calculated by taking half the average squared difference between the paired indicators:

**Fig. 32.7** Example used to show how complete experimental indicator variograms can be used to reconstruct an image


**Fig. 32.8** All 36 pairs in the image in Fig. 32.7, grouped into the 12 lags

$$\gamma(h) = \frac{1}{2N(h)}\sum \left[ I(\mathbf{x}) - I(\mathbf{x} + h) \right]^2$$

Because the squared difference between 0 s and 1 s is always 0 or 1, all of the terms in the summation are either 0 or 1; the summation is simply a counting of the number of times that the indicators separated by *h* are different.


**Table 32.1** Look-up table for the experimental indicator variograms and cross-variogram for every lag for the image in Fig. 32.7

**Fig. 32.9** The sequence of steps used to interrogate Table 32.1 to deduce values in specific cells, the knowledge of which can then be used to fix the values of other cells by using other information from the look-up table. The sequence begins in the upper left where the look-up table is used for its information on the lag that spans the main diagonal. It then proceeds across the first row, down to the start of the second row, and across to the final solution at the lower left

For the image in Fig. 32.7, Table 32.1 gives the complete look-up table of the indicator variograms and cross-variograms in every lag, and includes for each lag the value of the summation term before the division by 2*N*(*h*), i.e. the number of pairs in each lag that have different indicators; these are in the columns headed #Diff1, #Diff2 and #Diff12.

Figure 32.9 shows a sequence of steps that can be used to interrogate Table 32.1 for the information that allows the values of specific cells to be deduced. It begins in the upper left with the (2, 2) lag that spans the main diagonal. There is only one pair that contributes to this lag and the (2, 2) row, (second from the bottom of Table 32.1) tells us that:


The second of these facts says that the two values are either 2 and 3, or 1 and 3; but the second choice is contradicted by the first fact, so the only choice is a 2 in one cell and a 3 in the other. This gives us the next frame in Fig. 32.9, where a 2 has been fixed in the lower left and a 3 in the upper right. Note that this is exactly where the 180° rotation may occur because we can't tell which one is the 2 and which is the 3. But once we make a choice, everything else is fixed; so the worst that will happen is that the final solution will be rotated upside-down.

Proceeding across the first row of Fig. 32.9, the next thing we check is the (1, 2) lag, to which two pairs contribute. The look-up table entries for the (1, 2) lag, fifth row from the bottom, tell us that both pairs have the same I1 and I2 indicators, because of the 0s in the #Diff1 and #Diff2 columns. The only way that this can occur is if the value paired with the 2 in the lower left is also a 2, and the value paired with the 3 in the upper right is also a 3.

Continuing across the first row of Fig. 32.9, the next thing we check is the (0, 2) lag, to which three pairs contribute. The look-up table entries for the (0, 2) lag, fifth row from the top, tell us that all three pairs have different I2 indicators, because of the 3 in the #Diff2 column. This tells us that the upper right corner must be a 3, and that the lower left is either a 1 or a 2.

The sequence continues on the second row, where we check the (2, −2) lag. There is only one pair that contributes to this lag, and this pair has different values for the I1 indicator, because of the 1 in the #Diff1 column. The only way this can happen is if the value in the lower left is a 1.

Continuing across the second row, the next thing we check is (2, −1) lag, to which two pairs contribute. The entries for the (2, −1) lag, third row from the bottom, tell us that both pairs have the same I1 indicators, so the 1 in the bottom right must be paired with a 1, and the 3 in the upper left must be paired with either a 2 or a 3. For the same two pairs, one of the I2 indicators is the same and one is different; we know that the pair with the same I2 indicators is the pair of 1–1 values that we just fixed, so it's the other pair that must have different I2 indicators. We already know that the 3 must be paired with a 2 or a 3, so the only correct choice is a 2.

Moving along the second row, the last thing we check is the (0, 1) lag, to which there are six pairs that contribute. In the #Diff1 column, the top row in Table 32.1 tells us that five of the six pair have different I1 indicators. With the eight values already fixed in previous steps, we can see three of those (0, 1) pairs: the 3–1 and 1– 2 pairs in the first column and the 2–1 pair in the last column. But the only way we can get to five such pairs is if the middle column gives us two more. So the only correct choice for the middle cell is a 1 … which gives us the last value, and completely reconstructs the original image (Fig. 32.7) with no conditioning data, but with heavy use of the information in the complete table of indicator variograms.

Regardless of the size of the image, or of the number of levels in the grayscale (or number of colors in a color image), the approach of starting at the corners and working inwards will always work. There is enough information in the complete look-up table of experimental indicator variograms and cross-variograms that the corner pixels can be pinned down and then used to leverage the solution for the neighbors. In this particular example, the indicator cross-variogram was never needed for the final solution. It may be that the indicator cross-variograms are never needed, and that the image can always be exactly reconstructed (up to a 180° rotation) using only the indicator variograms.

### **References**


Deutsch CV, Tran TT (2002) FLUVSIM: a program for object-based stochastic modeling of fluvial depositional systems. Comput Geosci 28(4):525–535


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **Chapter 33 Predictive Geometallurgy: An Interdisciplinary Key Challenge for Mathematical Geosciences**

**K. G. van den Boogaart and R. Tolosana-Delgado**

**Abstract** Predictive geometallurgy tries to optimize the mineral value chain based on a precise and quantitative understanding of: the geology and mineralogy of the ores, the minerals processing, and the economics of mineral commodities. This chapter describes the state of the art and the mathematical building blocks of a possible solution to this problem. This solution heavily relies on all classical fields of mathematical geosciences and geoinformatics, but requires new mathematical and computational developments. Geometallurgy can thus become a new defining challenge for mathematical geosciences, in the same fashion as geostatistics has been in the first 50 years of the IAMG.

**Keywords** Geostatistics ⋅ Statistical scales ⋅ Microstructure ⋅ Computational geometry ⋅ Processing optimisation ⋅ Value of information ⋅ Mineral liberation analyser ⋅ QUEMSCAN

### **33.1 Introduction**

*Geometallurgy*, from the Greek words for earth (geia), metal (metallo) and work (ergon), can be understood as the exploitation of a metallic ore based on a precise understanding of its geoscientific characteristics. Geometallurgy is hence a cooperation field for geoscientists and mineral processing engineers, something which has occurred in virtually all mining operations. A modern understanding of geometallurgy, what we could call *predictive geometallurgy*, proposes a quantitative approach to the subject. In rough terms, that requires optimizing the ore processing based on automated mineralogy and microstructure characterisation of the ore, coupled with geometallurgical tests. These are tests conducted at several scales (from lab to plant) along which the actual ore is processed in realistic conditions in order to study the differential behaviour of the several ore and waste mineral phases, and thus the enriching potential of the ore through the processes considered.

K. G. van den Boogaart ⋅ R. Tolosana-Delgado (✉)

© The Author(s) 2018

Helmholtz Institute Freiberg for Resource Technology, Freiberg, Germany e-mail: r.tolosana@hzdr.de

B. S. Daya Sagar et al. (eds.), *Handbook of Mathematical Geosciences*, https://doi.org/10.1007/978-3-319-78999-6\_33

As a subject, mathematical geosciences has always had a wide application in mining. Nowadays typical topics of the area are geostatistics, the analysis of data from special scales (such as compositional data or spherical data), numerical analysis of flow models, remote sensing, (mineral) potential modelling (for instance with weights of evidence), fractals, geodata standards, 3D geomodelling, or data integration techniques. The aim of this chapter is to show the deep link between geometallurgical problems and techniques from the main fields of mathematical geosciences.

Geometallurgy distinguishes primary and secondary properties of the ore (Coward et al. 2009). Primary properties are intrinsic to the ore and do not depend on the process. Secondary or response properties describe the behaviour of the ore during processing. The primary properties are observed by chemical assays, automated mineralogy (like with QUEMSCAN or Mineral Liberation Analyser—MLA—), Xray methods, and other analytical instrumentation. Secondary properties are measured with geometallurgical tests, such as blasting tests, Bond mill test, flotation tests, magnetic separation, density separation and so on. These can even be conducted using the operation itself, that is, on the real plant. The secondary properties are used to predict the outcome and costs of the processing.

To the authors' knowledge, all studies conducted on predictive geometallurgy by mathematical geoscientists (Bye 2011; Boisvert et al. 2013; Rossi and Deutsch 2014; Hosseini and Asghari 2015; Tolosana-Delgado et al. 2015; Ortiz et al. 2015; Deutsch et al. 2016) consisted on appropriately predicting the secondary properties at each block of a mining block model, and proposing the mining and processing engineers to conduct their mine planning and plant scheduling based on those properties instead of on metal grades. The first step (Vann et al. 2011) is the geometallurgical analysis of the ore body with respect to its primary properties. Samples of similar primary properties or geology are often said to belong to the same *geometallurgical domain*. Conventional descriptive exploratory analysis like k-means clustering, PCA (Caciagli Warman 2015) or machine learning methods are nowadays used for this task. Moreover, primary properties are also interpolated to the block model, ideally with geostatistics.

The second step is a geometallurgical testwork, i.e. the characterisation of secondary properties of material from different geometallurgical domains. Often the goal of these tests is to define a mapping from the primary properties to the secondary properties, e.g. via more or less complex regression models (Keeney et al. 2011; Everett and Howard 2011; Sepulveda et al. 2017). Having it makes possible to populate the block model with estimated secondary geometallurgical properties and to infer the expected income and costs of each block. Such interpolation of secondary variables is often done on additive proxies (Ortiz et al. 2015; Deutsch et al. 2016). The result is typically called a *geometallurgical (block) model*.

This can be used in at least three different ways by an operation, to inform both in short- and long-term actions (McKay et al. 2016). First, the prediction of costs and recovery allows to assign monetary values to each block. These values can be used instead of grade as better proxy of cashflow in further calculations, like the mentioned ultimate pit or mine scheduling. Value is generated by minimizing capital costs, due to early exploitation of highly valuable parts of the deposit, and by an improved distinction between ore and waste (Bye 2011). Second, the predicted properties can be used as well to find matching ore partners in blending to reduce feed variability in the plant, and ensure constant plant operation conditions. Value is generated by lower risk of plant failure, optimal use capacity of all parts of the plant, and lower controlling efforts by the ability to find the optimal operation conditions empirically (Shaw et al. 2013). The third option is to use that knowledge to actively adapt the processing conditions to each portion of the varying feed. The value lies in higher recovery, lower operation costs, more extensive exploitation (Powell 2013; Tolosana-Delgado et al. 2015) and ultimately lower ecological footprint.

### **33.2 Process Modelling**

With the exhaustion of simple-texture, single-commodity, easy-to-reach deposits, the mining industry has been confronted with the need to study a broad range of ore properties, beyond the classical grade. As mentioned in the introduction, predictive geometallurgy proposes to obtain a wealth of primary and secondary properties at each mining block in order to reproduce its behaviour through the processing chain and, ultimately, to predict its monetary value. This section focuses on such process modelling.

A couple of steps along the value chain after extraction and crushing, ores are treated with a variety of processes, mostly physical and physico-chemical, in order to liberate the several mineral grains and separate them in different streams. Later on, streams enriched in ore minerals are sent through metallurgical processes, mostly chemical and physical changes of state processes devised to break the crystal structure of the ore minerals and produce the final value metals. All these steps can be studied with two approaches. In the first one, each operation unit is considered as a black box, and data from both the conditions of operations and the properties of input and output streams are obtained in order to build empirical rules to predict the output streams (Matos Camacho et al. 2015). In the second strategy, these prediction laws are built in accordance with thermodynamical, chemical and physical first principles. These strategies are not mutually exclusive, as one can derive the form of a parametric predicting equation by first principles and fit the parameters with the empirical approach.

The first kind of processes mentioned, those mostly keeping the crystal structure of the minerals involved, include many different processes. Grinding and milling aim at splitting particles in order to produce single mineral, or *liberated*, particles. Sizing, magnetic separation, density separation and many other separation processes aim at splitting a *feed* stream into two or more streams with particles primarily classified according to one particular bulk volumetric property, like size, magnetic susceptibility or density. Finally, froth flotation aims at separating particles according to the hydrophobicity of its surface minerals as they fall through a bubble-rich 2- or 3 fluid medium (including water, gas, nonpolar liquids, oils). This is one of the most complex yet barely understood processes in minerals processing, including effects from fluid dynamics, surface physics, organic and anorganic chemistry. In processing plants, several of these processes might be combined so that the output streams of each processing unit is fed into other units, thus building serial or parallel chains, trees and even complex networks, with feed-back loops.

Particle based models (Lamberg 2011) are a particular simple and promising modelling strategy, primarily of use for such networks of minerals processing processes. Here, each particle of the general feed is given a probability of going to each one of the output streams of each processing unit, according to its singular properties and certain characteristics of the bulk material within the unit. As long as these probabilities can be considered constant in time, the transient behaviour of the system can be modelled with a simple system of first order differential equations with constant coefficients (Tolosana-Delgado et al. 2015). Other more complex settings, in particular, milling steps within loops, pose a much more complex challenge and remain yet unexplored to the authors' knowledge.

The second kind of processes typically destroy the ore mineral structure into a fluid state: a water solution (hydrometallurgy, electrometallurgy) or a melt (pyrometallurgy). All these processes can be modelled with relatively well-known thermoelectro-chemical reactions. Lack of space and a certain distance from the classical fields of mathematical geosciences made us leave the subject out of this contribution.

Whichever strategy of modelling is followed, it is necessary to characterise the frequency distribution of certain properties on the particle streams. The most obvious are the size and mineralogical composition of the particles, in exposed surface, mass and even in volume proportions. Derived from these, elemental deportment and liberation distribution are also relevant. Elemental deportment is the proportion of a given element mass apported by each mineral. The liberation distribution gives the volume (or mass) of particles containing a certain mineral in a (volume, mass or surface) proportion equal or larger than a threshold, as a function of that threshold. This is a cumulative distribution in the fashion of the better known recovery and tonnage curves in classical Geostatistics. Finally, more complex mineral association or paragenesis indicators do also matter, as often concentration processes do not target the value minerals themselves, but some accompanying, more abundant minerals. Next section discusses which instruments are used to measure these properties and which are the challenges brought with them to mathematical geoscientists.

### **33.3 Ore Characterisation**

In the past, one-commodity grade was considered the sole and sufficient variable to characterize a mining block or a deposit. This variable could be more or less safely considered as a positive variable yet with an interval scale, according to the definition by Stevens (1946). This explains why Geostatistics was originally concerned with univariate properties following the properties of Gaussian or lognormal random fields (Journel and Huijbregts 1978).

However, the present and the future evaluation of a mining operation will require many more variables, kinds of scales and new geostatistical models. Multicommodity grades, geochemistry and mineralogy, being vectors of positive or relative components (Pawlowsky-Glahn 2003; Boogaart and Tolosana-Delgado 2013), have already brought the need of considering multivariate ratio scales and compositional scales (Caciagli Warman 2015). The routine analysis of mineral and chemical properties by techniques like X-ray Fluorescence (XRF) or Instrumental Neutron Activation Analysis (INAA) for bulk geochemistry, X-ray Diffraction (XRD) for bulk mineralogy, or Electron Probe Microanalysis (EPMA), Proton-Induced X-ray emission (PIXE), Laser Ablation Inductively Coupled Mass Spectrometry (LA-ICP-MS) or Raman spectroscopy for single grain or locally resolved chemistry and mineralogy will ensure a continuous growth of compositional and multivariate positive data in predictive geometallurgy. The generalisation of microstructural analysis, with machines like QUEMSCAN, MLA or X-ray tomography (Bam et al. 2016; Becker et al. 2016), will make further primary properties easy to obtain: particle size curves (showing a distributional scale (Delicado 2008; Menafoglio et al. 2016a)), interphase mean contact length composition (a sort of two-way composition (Caracciolo et al. 2012)), grain size curves of each mineral phase (a discrete set of parallel distributions), deportment (a composition informing of the proportion of mass of a certain element contributed by each of its bearing minerals), and many more properties. Even the application of EBSD (electron backscatter diffraction) will make it possible to characterise the distribution of crystal orientations (spherical distributions) or its modal values (spherical directions). Spectral information is also produced by many instruments, and although spectra ar nowadays preferable interpreted in terms of chemical elements, minerals or paragenesis (Chlingaryan et al. 2015) before treatment, one might think of future applications in which core scanning or airborne spectral data are considered as informative on their own in a 3D geomodel. Consider that spectral information is easy and fast to obtain in the operation and thus could help to guide the extraction process and identify ore types during mining and further processing (Nguyen 2013).

Many of these characterisation techniques can be ordered in a chain of methods, where the more advanced methods provide more and more detail but at the price of lower precision, higher costs, and longer aquisition or turnaround times. For instance, XRD, though primarily measuring modal mineralogy, can be used to infer bulk geochemical composition, though with higher uncertainty than directly using XRF. Also, MLA, though primarily measuring grain and particle structures, can provide a modal mineralogy, but at higher costs than XRD. Finally, EBSD allows to characterize crystallites and defects, but can also be used to infer the mineralogical microfabric, albeit at longer measurement times than MLA for a fixed precision.

The other way around, inferring more advanced characteristics of the ore indirectly from cheaper measurements, is in general an inverse problem. Inverse problems are much more difficult to handle and often do not have a unique solution. For instance, inferring modal mineralogy from XRF is an *endmember problem*, and delivers at most equivalence classes of solutions (Tolosana-Delgado et al. 2011; Berry et al. 2011). Interpreting spectra into chemical and mineral compositions often requires as well unmixing the signal obtained as a linear mixture of known endmember spectra. Finally, inferring processing properties from primary properties might require statistical models or machine learning methods to approximate the inverse problem solution (e.g. Matos Camacho et al. (2015) for magnetic susceptibility from MLA data). In summary, each analytical method has a specific role to play, and several methods will be required to appropriately characterise all relevant aspects of the ores.

Another classical class of metrological problems appearing in ore characterisation is *instrumental calibration*, namely the inference of the composition of bulk samples or spots by comparing their signals with the signal obtained from a reference material or *standard* where the property is known, as well as the corresponding uncertainty. The specific challenges for geometallurgy are the high variability of natural materials, difficult to reflect in standards with comparable compositional and physical characteristics (called *matrix matched*), and to measure in a single method. This concerns many of the techniques mentioned before, like XRF, INAA, ICP-MS, PIXE and EPMA.

From the point of view of mathematical geosciences, these problems imply calibration problems, data fusion and consensus building. Data has often been collected during different periods with different instruments at different labs. Seldom all methods were applied to all locations. Different batches need to be made compatible and calibrated against each other. In the authors' opinion, solutions for such problems will require existing concepts and tools and new developments from geodata management, geo-ontology and geoinformatics.

Additionally, local analytics techniques (MLA, QUEMSCAN, X-ray tomography, PIXE, EPMA, Raman) bring their own problems to be solved with mathematical geosciences techniques. It is often very challenging or impossible to acquire standard material homogeneous at micron scale and matrix-matched to the ore samples. Geostatistical models have been proposed for supporting such local calibration efforts (Tolosana-Delgado et al. 2013).

Imaging techniques are also becoming more and more popular, at all spatial scales. More and more methods (hyperspectral satellite- and air-borne, drone-borne imaging, mine face imaging, core scanning, EBSD, MLA, X-Ray-CT, PIXE, …) acquire images rather than only univariate or compositional information. On large scales, from the drill core to deposit scale, imaging gets a rising importance for the characterisation of the meso- to megastructure of the deposit, because selectivity of ore zones from barren zones during exploration, mining, extraction and waste pre-screening is highly dependent on such structures. If we focus on submillimeter scales, processing methods and processing costs react very sensitively to analogous microstructural properties: for instance, the type of intergrowth of minerals strongly conditions the necessary milling to achieve sufficient liberation (Perez-Barnuevo et al. 2013), and milling is one of the most cost intensive processing steps. Many of these methods measure spectral information at each pixel. Various supervised and unsupervised machine learning techniques have been used for mapping spectral information to geometallurgically relevant quantities (Decamp et al. 2015; Harraden et al. 2016; Nguyen et al. 2016). Image processing analysing structure will thus become more and more relevant in geometallurgy.

Moreover surface imaging techniques like MLA or QUEMSCAN suffer of stereologic degradation: these instruments are devised to characterise geometric properties of 3D bodies, but only observe them on 2D sections. It is well-known that only some 3D properties can be estimated unbiasedly by averaging over their 2D counterparts. This allows e.g. to have certain confidence in properties like volumetric modal mineralogy (estimated from the proportions of pixels of the several minerals on the measured surface), mineral association as the proportion of surface of a mineral in contact with all other minerals (estimated from the proportion of contact lengths on the measured surface) or specific surfaces. But other highly relevant properties, like liberation distribution, grade curves, tonnage curves or particle and grain size distributions suffer significant stereological degradation (Perez-Barnuevo et al. 2012).

Open problems for the next generation of mathematical geoscientists will include, to mention a few, the development of widely accepted local analytics calibration procedures; the propagation of uncertainties through image analysis methods; or the integration of several analytical techniques through consensus-building, e.g. to deliver mutually consistent measurements of bulk mineral and chemical compositions as well as elemental deportment together with their uncertainties out of XRD, XRF, EPMA and MLA measurements of the same sample. Correcting stereological degradation is as well an open issue.

### **33.4 Orebody Modelling**

The generation of large scale 3D models of the ore bodies is the classical key contribution of Mathematical Geosciences to the mining business. Nowadays, point and block kriging or simulation for grade variables and indicator-based techniques (indicator kriging, sequential indicator simulation, plurigaussian simulation) for categorical variables are accepted standard techniques. Beyond the framework of Gaussian random fields, cumulant based (Dimitrakopoulos et al. 2010; Minniakhmetov and Dimitrakopoulos 2017) and Copula based (Musafer et al. 2013, 2017) proposals, as well as multiple point geostatistics (MPS) can be found in scientific papers, though their penetration and acceptance in the industry is yet negligible. Multivariate issues are also seldom considered, though compositions (mineral or chemical) are geometallurgically relevant primary variables, and techniques do exist to predict or simulate them at both point (Pawlowsky 1989; Pawlowsky-Glahn and Burger 1992; Pawlowsky-Glahn and Olea 2004; Tolosana-Delgado 2006; Tolosana-Delgado et al. 2011; Mueller et al. 2014) and block support (Tolosana-Delgado et al. 2013) in a fashion consistent with their scale, namely delivering positive and constant-sum predictions/simulations abiding to a relative scale.

The geostatistical treatment of other geometallurgically relevant multivariate scales has received limited to no attention so far by the mathematical geosciences community. The challenges are multiple (Boogaart et al. 2013). Geometallurgical data from EBSD are known to exhibit spherical scales, for which a kriging approach is readily available (Boogaart and Schaeben 2002a, b). One-dimensional distributions are much more abundant, and methodological developments for kriging, cokriging and conditional simulation exist via functional analysis (Menafoglio et al. 2016a, b). Nevertheless, application to the many geometallurgical data with distributional scale still requires theoretical and practical developments. Upscaling of these geometallurgical properties present counter-intuitive characteristics: for instance, a categorical variable at point support gives rise to a compositional variable at block support, and while block kriging is generally thought to reduce uncertainty, block "estimates" of distributional and of categorical variables may very well exhibit higher entropy themselves. With a few exceptions based on geostatistical simulation (Deutsch et al. 2015), downscaling has not yet been systematically considered, but it may become a necessary tool to populate block models with smaller scale granularity, for instance for incorporating information from blast-hole analysis on the 3D models. Finally, the joint consistent modelling of several variables from different scales (for instance modal mineralogy, geochemistry, hardness and lithology) has received limited attention (see Maleki and Emery 2015 for a two-point case study with one continuous and one categorical variable), and only seminal ideas about the combination of Bayesian spaces (Boogaart et al. 2014), multigrid Markov Mesh Models (Stien and Kolbjornsen 2011; Kolbjornsen et al. 2014), generalized linear models and MPS have been presented for discussion (Boogaart et al. 2014).

It has been shown that the conditional distribution of the geostatistical simulation is highly relevant for optimal processing choices (Boogaart et al. 2013). Gaussian geostatistics only delivers that correctly in a Gaussian random field setting. Like with strategic mine planning (Dimitrakopoulos 2011; Goodfellow and Dimitrakopoulos 2017), non-linear simulation methods better reproducing the conditional distributions would thus be more appropriate for geometallurgical optimisation. However so far (April 2017), beyond single categorical variables, no case studies could show the added value of MPS methods in the context of geometallurgy. The fundamental difficulty appears to be producing sufficiently large, stochastically representative training images (Emery and Lantuejoul 2014), a problem made even worse by the many relevant variables, some with multivariate, compositional or distributional scales.

Besides the geometric modelling of the large-scale structure of a deposit, 3D Geomodelling offer also a tool for modelling and simulation of microstructure and texture of the ores. Stochastic simulation of such 3D geomodels of ores might be necessary to appropriately simulate breakage of microstructure by crushing, grinding and milling, as well as to offer an approach to stereological reconstruction. This is so because all these problems require an appropriate description of the geometric spatial relations between the mineral grains, and not just summaries of their composition. However, new concepts, models and techniques have to be developed to link the macroscale described by geostatistics and the microscale, possibly described by stochastic geometry.

Another challenge posed by such multi-scale (in the sense of spatial granularity), multi-scale (in the sense of statistical kinds of data), multi-step (data is added to the models at different times), multi-dimensional geometric modelling of ore bodies is the structuring, management and exploitation of the necessary data to appropriately provide input for the methods used. A more intimate link between geostatistical and geodatabases will be required for that, as flexible and sequential conditioning methods able to incorporate into the conditional distributions data on batches, as they become available. Sequential data assimilation techniques have been successfully used for this task in the assessment of univariate quantities (Wambeke and Benndorf 2017).

### **33.5 Decision Making**

Geometallurgy touches on all levels of optimization of the mining operation, from exploration, investment, and strategic mine planning towards the daily operation. Each optimization task can be stated as a *w*-question, and delimits a certain scope of the decision to be taken.

Blending ores from different localities to ensure a stable feed properties for the plant presents the smallest decision scope, as it only changes *where* to mine and not *when* or *how to process*. Having the ability to predict mining and processing behaviour for different feed materials allows to better predict block values or machine time and maintenance requirements. Such better block values can be used in classical strategic mine planning tools for an optimal exploitation of the deposit, that is answering the *when* and *where* issues related to pushbacks and ultimate pit calculations. For this task a statistic model relating the primary geometallurgical properties, with secondary ones is typically enough (Vann et al. 2011). If the processing model is good enough to predict the value as a function of the processing choices, it can be used in conjunction with a geostatistical description of the geometallurgical ore properties to optimize the processing itself either for the whole deposit or each block (optimal adaptive processing) (Turner-Saad 2011; Tolosana-Delgado et al. 2015). Goodfellow and Dimitrakopoulos (2017) shows how blending, strategic mine planning and routing can be optimized together. The optimizability, i.e. the optimal achievable productivity, depends on very basic decisions like the size of selective mining units, available equipment and available data. The overall value of the mine and thus the decision to mine itself depends on all details. Boogaart et al. (2015) shows the relevance of the selective mining unit and the decision strategy for the value of the mine (*how to model*). Boogaart et al. (2016) shows how to quantify these values and the value of the available equipment, determining costs and available processing choices, before the actual mining operation starts. Such calculations are based on geostatistical simulation, and thus allow to optimize the geometallurgical approach (*how to optimize*) and the investment (*what to build*). Boogaart et al. (2016) show the substantial influence of the exploration plan and the data aquisition strategy (e.g. the influence of processing data) on the overall value of the operation and how quantifying the value of information can be used to optimize the geometallurgical exploration strategy. This offers a way to economically justify and timely plan extensive geometallurgical data aquisition campaigns (*what and when to measure*).

All these approaches rely on stochastic optimisation in a geostatistical framework for geometallurgical data combined with a geometallurgical processing model, both based on quantitative ore characterisations. That is, they rely on the mathematical tools described in the preceeding three sections. Applying these techniques is still a major geoinformational challenge including big data management, data fusion, massive parallel computing and real time data management (Jones and Moorhead 2013; Lopez et al. 2016).

### **33.6 Conclusions**

Geometallurgy requires substantial geomathematical developments in all the classical fields of mathematical geosciences and geoinformatics. The challenges are beyond the classical solutions, e.g. a truly multivariate, multi-scale Geostatistics honoring non-Gaussian relationships is required; statistical analysis for various scales beyond positive data and compositions is required, in particular distributional data; a full space-time 3D data fusion and fast automated updating of models will be required; there are new challenges to the mathematical background of metrology including issues of local analytics, compositional calibration, and varying material matrices; structural characterisation on several scales from the ore body to the microfabric are needed on a quantitative level from limited 2D stereological data and supportive conditioning information (bulk mineralogy and geochemistry, accessory information on mineral stoichiometry, cristallographic defects, etc.); geostatistical models of the spatial variation of the microstructure throughout the deposits (i.e. a structure Geostatistics) needs to be developed; and so on.

The mathematical challenges of integrating characterisation, stochastic modelling, process simulation and optimisation, and data reconciliation, will extend to manmade and secondary resources (tailing dams, recycling, urban mining) and to the optimisation of other geosystems (water management, ecosystem management, urban ecosystems, the trisystem of energy-minerals-water), hence the lessons learnt from primary ores geometallurgy will be relevant for many fields beyond ore geology and mining. Beyond the classical fields of mathematical geosciences, geometallurgical questions will as well require solutions from mathematical disciplines uncommon at the IAMG, like optimisation, operations research and numerical process modelling. Thus, geometallurgy extends the scope of the IAMG towards these fields. In this way geometallurgy can become the scientific and economic driving force for the next generation of mathematical geosciences and geoinformatics.

### **References**

AusIMM (2011) First AusIMM international geometallurgical conference. AusIMM AusIMM (2013) Second AusIMM international geometallurgical conference. AusIMM AusIMM (2016) Third AusIMM international geometallurgical conference. AusIMM


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **Chapter 34 Data Science for Geoscience: Leveraging Mathematical Geosciences with Semantics and Open Data**

**Xiaogang Ma**

**Abstract** Mathematical geosciences are now in an intelligent stage. The freshly new data environment enabled by the Semantic Web and Open Data poses both new challenges and opportunities for the conduction of geomathematical research. As an interdisciplinary domain, mathematical geosciences share many topics in common with data science. Facing the new data environment, will data science inject new blood into mathematical geosciences, and can data science benefit from the achievements and experiences of mathematical geosciences? This chapter presents a perspective on these questions and introduces a few recent case studies on data management and data analysis in the geosciences.

### **34.1 Introduction**

The global science community is facing a fresh data environment that never existed before. New generations of sensors, instruments and platforms extend the range of exploration and speed up the frequency of data collection. The quick updates in data storage facilities make it possible to archive and retrieve massive datasets in digital formats. The wide coverage of Internet and World Wide Web services allow researchers to share datasets and communicate with colleagues efficiently both in the office and from the field. As transparency, openness and reproducibility of research results and methods receive increasing attention, the science community is now promoting an open science culture (Nosek et al. 2015) and encouraging actions on open access, open data, open code and open samples (Easterbook 2014; Hey and Payne 2015; McNutt et al. 2016). In the domain of geoscience, significant progress has been achieved on open data, including those emanating from federal agencies such as data services of NASA, USGS, NOAA and community-built data portals such as OneGeology, EarthChem, RRUFF, PANGAEA, PaleoBioDB, and more.

X. Ma (✉)

Department of Computer Science, University of Idaho, 875 Perimeter Drive MS 1010, Moscow, ID 83844-1010, USA e-mail: max@uidaho.edu

<sup>©</sup> The Author(s) 2018 B. S. Daya Sagar et al. (eds.), *Handbook of Mathematical Geosciences*, https://doi.org/10.1007/978-3-319-78999-6\_34

A clear trend in open data actions is that the World Wide Web is used as the space for data storage, publication, discovery and access. Data resources on the Web provide convenience for geoscience researchers, and lay out the platform for cross-disciplinary collaboration and new scientific discoveries.

In addition to focused research topics within each discipline, geoscience researchers in the 21st century are now able to tackle more grand research questions (Fig. 34.1) that need broad perspectives, multidisciplinary collaboration and sustained data support. Studies on these questions will lead to the extension of our fundamental knowledge and understanding about the Earth system, which in turn will contribute to the application of geoscience in tackling social and economic issues that are relevant to human welfare. For example, the Future Earth, a ten-year initiative (2015–2025) coordinated by several international organizations, proposed eight key challenges to the global sustainability (Future Earth 2014): water-energy-food nexus, decarbonization, natural assets, cities, rural futures, human health, consumption and production, and social resilience. To grasp these tremendous opportunities and make innovative discoveries, geoscience researchers need the necessary data resources and skills. Although geoscience data are increasingly made available online, due to the heterogeneities inside them, many data are not ready for use by end users. The heterogeneities of geoscience data are reflected in the vast number of subjects, varied data structures and formats, and diverse terminologies (Berg-Cross et al. 2012; Ramachandran et al. 2006; Reitsma and Albrecht 2005). Methods and skills of both data management and data analysis are needed for conducting science within the inspiring and complex data environment of today.

Data management and data analysis are the two key concepts in data science (cf. Schutt and O'Neil 2013), which involves knowledge of library and information science, computer science, mathematics, statistics, and domain-specific disciplines. While the theoretical foundations of data science are still under development (Drineas and Huo 2016), there have already been many applications and


**Fig. 34.1** The 10 grand research questions for the 21st century Earth sciences (National Research Council 2008)

**Fig. 34.2** Primary steps in a data science process. From Schutt and O'Neil (2013) with changes

discussions of data science in recent years (Schutt and O'Neil 2013), and a general process of data science is emerging (Fig. 34.2). The steps and processes in Fig. 34.2 would be familiar to researchers in all disciplines mentioned above, as they are comparable to the widely-adopted hypothesis-driven research method in modern science. Nevertheless, there could remain many questions to be asked as we are now in the "inspiring and complex data environment": Do we have methods and techniques to improve the efficiency in each step? How to create a space and design an approach where researchers from the different disciplines can collaborate and leverage their individual capabilities to achieve a focused objective? What is the feature of data science in a domain-specific context, including geoscience?

Researchers of mathematical geosciences or geomathematics can have a lot to say about their experience and understanding of data science, because mathematical geosciences is a domain with a long history of incorporating knowledge from computer science, mathematics and statistics with geoscience (Agterberg 2014; Bonham-Carter 1994; Loudon 2000; Merriam 2004). Will the latest research progress of data science inject some new blood into the mathematical geosciences; and vice versa, can the methods and experiences in mathematical geosciences contribute to the theoretical developments of data science? The purpose of this chapter is to present a perspective on questions based on a review of the evolution of mathematical geosciences and a summary of the latest discussions of data science within the geoscience community. To support the presented perspective, a few recent case studies will be introduced in the second half of the chapter, with a focus on how data science can help leverage the existing capabilities in geoscience research and achieve new goals.

### **34.2 The Intelligent Stage of Mathematical Geosciences**

### *34.2.1 Evolution of Mathematical Geosciences*

Retrospection on the evolution of mathematical geosciences will help us understand the characteristics of this discipline as well as the opportunities it faces today. In an informative review, Merriam (2004) summarized the six stages in the development of quantitative geology: Origins (1650–1833), Formative (1833–1895), Exploration (1895–1941), Development (1941–1958), Automated (1958–1982), and Integration (1982–). The three earlier stages, over a period of almost 300 years, made use of various developments in both geoscience and mathematics, and more importantly the co-evolution between them. The latter three stages were characterized by the application of computers, first in geostatistics, simulation and modeling, and the organization of large datasets and later in all aspects of the geoscience workflow, including data capture, manipulation, analysis and documentation. Merriam (2004) also briefly mentioned the Internet and the potential challenges and opportunities in the connected virtual world, and he stated, "There is seemingly no limit to the information and communication revolution."

Indeed, coming to today, which is just about 12 years after Merriam's review paper, geomathematical researchers as well as the broad geoscience community already face the fresh data environment. We now have new instruments for measurement and observation, powerful facilities in data storage and transmission, improved interoperability of online datasets, and effective algorithms for data processing and analysis. New methods and technologies such as big data, open data, machine learning, data mining, data science, semantic web, natural language processing have been increasingly used in geoscience studies. The functionality of computers is being leveraged to a new level, where they are not only capable to represent "what is" known but can also show us "why" and help generate ideas on "how to" explore new findings. Ma (2015) proposed that the mathematical geosciences is now in an Intelligent stage (2014–). Besides these accelerated developments and applications of geomathematical methods within the geoscience disciplines, there are growing needs for using these methods in cross-disciplinary programs to address socio-economic issues that are of public concern (Freeden 2010).

In this intelligent stage, what we can do to leverage mathematical geosciences in various multidisciplinary studies? In this chapter, the author wants to address the need of refreshing our knowledge about the latest progress in open data and data science. For geoscience researchers, especially those who are not familiar with data science, knowing open data will be a key to understanding the general data science process and some featured works using datasets retrieved from the Web.

### *34.2.2 Characteristics of Open Data and Semantic Web*

Most geoscience studies are driven by data. The term "open data" reflects people's desire of access to freely available datasets. Some open data are made accessible with specified licenses and copyrights, and others are without any limits or restrictions. The popularity of the Internet and the Web creates a wide space for the implementation of open data. For end users of open data, an issue of extreme concern is the data interoperability (Fig. 34.3). Researchers have discussed the levels of data interoperability from different aspects. The levels in the center of Fig. 34.3 (Brodaric 2007) are from a technical point of view. Systems level is fundamental, which means there should be the necessary protocols (e.g. TCP/IP for the Internet and HTTP for the Web) supporting data discovery and transmission. Syntax and Schematics levels are relevant to the data structures and models, for which an end user should be able to parse and analyze. Semantics level indicates that the meaning of data reflected in data model, terminology and encoding are made readable to machines and thus understandable to users. Pragmatics level means the data are suitable for the user's purpose and can contribute value in applications. The right part of Fig. 34.3 (Ma et al. 2011) explains these technical levels with layman's language, and it also adds that all the technologies and implementations at those levels should be legal and ethical from a point of view of social science.

The Semantic Web (Berners-Lee 2000) provides technological support to each level of data interoperability (Fig. 34.3). For geoscience researchers, the Semantic Web creates a space where datasets can be more efficiently annotated, published, discovered and accessed. The Semantic Web is an extension to the current World Wide Web (Berners-Lee et al. 2001). The Web is now in the transition from a Web of Documents to a Web of Data because of the embedded structures and meanings that did not exist before. Nevertheless, to add structure and meaning to the

**Fig. 34.3** Levels of data interoperability and a comparison with the architecture of the Semantic Web. From Berners-Lee (2000), Brodaric (2007) and Ma et al. (2011)

information on the Web, definitions and representations of concepts and the interrelationships among concepts are needed (Berners-Lee. 2006). In the Semantic Web such definitions and representations are called ontologies. Each ontology is the formal specification of the shared conceptualization of a domain of study (Gruber 1995). In practice, ontologies can be of different forms, such as glossary, controlled vocabulary, conceptual schemas and detailed logic constraints, depending on the level of detail on conceptual specification. Semantic Web technologies provide the essential elements for modeling and encoding ontologies in machine-readable formats.

In the context of cross-disciplinary program with datasets from various resources and subjects and researchers from different knowledge domains, there could be a large number of ontologies addressing the various needs on knowledge engineering and concept representation. Those ontologies can be implemented to build innovative functions to support the discoverability, accessibility, understandability and usability of open data. For example, there can be projects on categorizing datasets and publications based on their subjects and keywords, recommending datasets or publications to a user based on his research interests, suggesting matches between datasets and scientific questions, and more. The data science domain recently also has proposed the topic "smart data" (Sheth 2014), which aims at using Semantic Web technologies to improve the efficiency in the transformation from massive datasets into actionable information.

### *34.2.3 Methodology of Deploying Data Science in Geoscience*

Although data science has already attracted significant attention in both academia and the industry, the theoretical foundations and technological systems of data science are still under development. In the summary report of a recent NSF-funded workshop (Drineas and Huo 2016), the emergence of data science as a discipline was compared to the rise of computer science in the 1950s along with the wide availability of computers, especially personal computers (PCs). The data deluge of today and its great potential for academia and industry are, in the report authors' language, a "forcing function" that will catalyze the emergence of data science departments in universities and nurture the development of data science as a discipline. At the current time, since we do not have established theoretical foundations for data science, we can understand the core of data science as a cross-disciplinary topic, or a blend of massive datasets with methodologies in existing disciplines, such as computer science, library and information science, statistics and mathematics. The application of data science will further extend the coverage of disciplines to other domains, such as geoscience.

In most scientific researches, including those in geoscience, a general research process includes the following steps: (1) Choose a general direction and do background research; (2) Generate a hypothesis; (3) Conduct experiments and collect data; (4) Analyze data and revise hypothesis; (5) Communicate results. We can compare those steps with the data science process in Fig. 34.2. Both processes follow a direction of data collection, data analysis and result communication, but there are also a few items worthy of further discussion. First, data science often faces a situation in which massive datasets are already in existence while we do not yet have a hypothesis. Second, the data science process addresses a step called data pre-processing, which detects the inconsistent, incomplete and incorrect parts in the datasets and takes actions to ensure the data quality before doing analysis. Data pre-processing is an essential step for large datasets collected from multiple sources. Third, the step of exploratory data analysis (EDA) offers clues for hypotheses in scientific research. EDA is a widely-used approach in statistics, and it covers many methods, such as scatterplot, box plot, residual plot, smoother, bag plot, and more (Brillinger 2011). The term "exploratory" explains the purpose of the method: it is flexible and can help look for things that we believe are not there or to be there (Tukey 1977). EDA helps address the shortage of research hypotheses for massive data that already exist. The functionality of EDA is comparable to the approach of data-driven abductive discovery (Hazen 2014). Abduction means the formation of a plausible explanation for an observation. Charles S. Pierce (1839–1914) viewed abduction as the first stage of scientific reasoning, i.e. to create a hypothesis. Then deduction will be carried out to determine the specific evidence needed to prove the hypothesis. After that, induction will be used to extrapolate a general rule or principle from the findings. Hazen (2014) summarized that abduction is to discover what we do not know we do not know, while deduction and induction are to discover what we know we do not know. This is comparable to Tukey's point of view on EDA (Tukey 1977).

One of the most significant challenges to deploy data science in geoscience is to create a space (physical and/or virtual) and establish an approach so that researchers from different disciplines can talk to each other. Science of today is highly compartmented into disciplines and there are considerable gaps between these, as reflected by differences in scientific subjects, research methods, terminologies used and even styles of working. The challenge of cross-disciplinary collaboration is like encouraging people to step out from their "comfort zones". Researchers in geoinformatics (Fox and McGuinness 2008; Ma et al. 2014b) have proposed a method called use case-driven iterative approach, and have successfully implemented it to facilitate the collaboration between data scientists and domain scientists in several projects. Each use case is a description of the process of a focused task. It can be used to identify scientific questions to ask, resources to be used to answer these questions and methods to be implemented to determine the answer. Through the documentation and analysis of a use case, data scientists and domain scientists (e.g. geologists) can understand the needs and aims of each other. As each use case is a focused small task, the collaborative team can spend a relatively short time to achieve the goal, and then can review, update and move on to the next use case. The process is iterative until the overall objective of a research program is realized.

### **34.3 Case Studies of Data Science in Geoscience**

When applying data science to leverage current geoscience studies, the focus or highlight can consist of one or a few steps, depending on the target aimed at. For example, the target can be improving data discoverability and accessibility by updating building blocks and frameworks in the cyberinfrastructure. It can also be focused on finding patterns within massive datasets such as those from literature legacy or crow-sourcing databases. In this section, a few recent efforts and case studies will be introduced.

### *34.3.1 Coordinating Standards to Improve Data Interoperability*

In the domain of geoscience, a few recent achievements on data standards and their implementation were led by CGI-IUGS (http://www.cgi-iugs.org), the Commission for the Management and Application of Geoscience Information within the International Union for Geological Sciences. GeoSciML was proposed as a markup language for the exchange of general geoscience information on the Web (Sen and Duffy 2005). GeoSciML was built on top of the Geography Markup Language (GML) and the eXploration and Mining Markup Language (XMML). The first geoscience subjects covered in GeoSciML included boreholes and structural geology. Raw datasets such as those in geologic maps can be transformed into GeoSciML formats once the mapping between the original data structure and the GeoSciML schema is set up. This makes it easier for data exchange and sharing among organizations and nations. GeoSciML was successfully implemented in the OneGeology project (Jackson and Wyborn 2008). On the front end of the OneGeology data portal (http://portal.onegeology.org), users can access geologic map services in a standard data structure. At the back end of the portal, there are multiple data providers, distributed data servers and different data structures. GeoSciML acts as a mediator between those heterogeneous structures and improves the data interoperability. Another significant contribution from CGI-IUGS is the multi-lingual geoscience vocabularies. Initial projects on geologic time and rock type vocabularies were applied in the OneGeology-Europe project to harmonize geologic maps from around 20 European countries (Laxton et al. 2010). Standards derived from those vocabularies also became a part of INSPIRE, the Infrastructure for Spatial Information in Europe (http://inspire.jrc.ec.europa.eu).

Such efforts on data standards are an essential part of informatics, especially applied informatics that has a domain specific background. Comparing with the geoscience community at large, the number of people working on geoinformatics is low. The value and gains that data standard work can provide are often not fully understood within the geoscience community (Jackson and Wyborn 2008). The situation has been changing in recent years since the value of data science was recognized by increasingly more geoscience researchers. For instance, besides GeoSciML, CGI-IUGS also has developed EarthReousrceML for the exchange of information on mineral occurrences, mines and mining activity. CGI-IUGS's Terminology Working Group has published additional standardized vocabularies. The geoscience community has also collaborated with standard organizations to improve the visibility of data standard outputs. In 2017, GeoSciML was published as a standard of the Open Geospatial Consortium (OGC) (OGC 2017), making it one of the first domain-specific standards in OGC. Geoinformatics researchers also take the lead in coordinating data standards among different scientific disciplines. In 2016, CODATA, the International Council for Science's Committee on Data for Science and Technology, set up a task group on coordinating data standards amongst scientific unions (http://www.codata.org/task-groups/coordinating-datastandards). The aim of the group is to take stock of the progress on disciplinary data standards in different scientific unions, recognize the best practices and coordinate the development of future work. Data standards provide the basic-level technical support when we collect and analyze datasets in cross-disciplinary projects. They significantly reduce the workload on data pre-processing and data cleansing in a data science process (Fig. 34.2).

### *34.3.2 Openness, Provenance and Reproducibility of Research*

Provenance and reproducibility are both regarded as important research topics in data science (Drineas and Hou 2016), and they are also essential parts of open science. The literal meaning of provenance is the origin of something. In data science, documenting provenance involves the annotation and interconnection of a network of research activities, people, organizations and resources involved in the production of scientific findings (Ma et al. 2014a). In 2013, the Semantic Web community released an ontology called PROV-O (Lebo et al. 2013). The three top classes Entity, Activity and Agent in PROV-O are easy to understand. The ontology also covers a list of subclasses and relationships that can be applied in domain specific applications. A recent successful implementation of PROV-O is the Global Change Information System (GCIS) (Tilmes et al. 2013), which is part of the U.S. Global Change Research Program (USGCRP, http://www.globalchange.gov). USGCRP is a multi-agency research program to "assist the Nation and the world to understand, assess, predict, and respond to human-induced and natural processes of global change." Every four or five years, USGCRP releases a National Climate Assessment Report with the latest scientific findings on different aspects global change. The most recent one was released in 2014. The initial aim of GCIS is to present the 2014 report and to incorporate integrated access to interlinked resources underpinning that report. The long-term goal of GCIS is to be a web-based source of authoritative, accessible, usable and timely information about global change. Semantic Web technologies, including PROV-O, were applied in the design and development of GCIS. The project included four major parts: categorization, annotation, identification and linking (Ma et al. 2014a), which are coherent within the architecture of the Semantic Web (Berners-Lee 2000). With the well-documented provenance information on the GCIS website (https://data. globalchange.gov), users will be able to conduct innovative research on provenance tracing data mining. For example, they can seek answers for the question: What is NASA's contribution to the sea-level rise scenarios in the 2014 National Climate Assessment Report?

Reproducibility in data science and open science includes at least two levels of meaning. The first is replicability of a research output by using the datasets and methods in the research. The second is the derived value, which means the open datasets and methods from that research can be reused in new research and make substantial contributions (Beaulieu et al. 2017). To improve the reproducibility of scientific research, several technical frameworks can be applied and/or adapted, such as workflow platforms and provenance documentation. In a recent study about reproducible marine ecosystem assessment (Ma et al. 2017), the PROV-O ontology was extended and implemented in the Jupyter Notebook (http://jupyter.org) to capture and interconnect information from various resources in a scientific research project. Jupyter Notebook is an open-source web application that can be used to create workflow documents with codes, formulas, tables, diagrams, interactive visualizations and descriptive text. The developed ontology further enhanced the function of the platform in capturing and presenting scientific provenance information. The work was used in the Ecosystem Assessment Program of the U.S. NOAA Northeast Fisheries Science Center to support assessment reports of Large Marine Ecosystems. In the implementation, a user works within the Jupyter Notebook to write codes and text for data input, analysis, output and documentation. Once the notebook is completed, the provenance information is automatically captured using the structure defined in the ontology. The collected provenance information is machine-readable and can be archived for later use, such as verifying steps and outputs in the workflow or retrieving raw datasets used in any given step.

### *34.3.3 Leveraging Geoscience Data Legacy for New Discovery*

Geoscience is a domain with abundant literature resources, and much useful information can be extracted from the data legacy. A recent study, originally called PaleoDeepDive (Peters et al. 2014) and now GeoDeepDive (https://geodeepdive.org), has demonstrated the significant value of geoscience publication archives through the application of machine learning and data mining technologies. The domain of focus in GeoDeepDive is paleontology and its aim is to detect and extract fossil occurrence information from the massive scientific literature. The work leverages methods in natural language processing, entity recognition and extraction and knowledge graph construction to improve the efficiency of document processing and the quality of output datasets. In several complicated data extrication and reasoning tasks, the outputs of GeoDeepDive were comparable to the results collected by human experts of geologic history (Peters et al. 2014). Most recently, several publishers and research organizations have set up partnerships with GeoDeepDive and provided a huge number of publications for processing. By middle April 2017, the team has already processed more than 3.2 million documents. The extracted fossil records and their interrelationships can provide useful updates to existing databases, such as the Paleobiology Database (PBDB, https://paleobiodb.org/). PBDB, in turn, has set up interfaces and libraries such as those for Web-based data query and retrieval (Peters and McClennen 2015) and the R environment (Varela et al. 2015). These projects build up channels through which any geoscience researcher can easily access datasets of interest and integrate them with other datasets in their own projects.

A project ongoing in the author's group is about using an ontology to help integrate datasets from PBDB with geologic map services provided by USGS and, thus, to build an enriched data portal where users can discover and access more information. Previous works already have shown the functionality of ontology and data visualization in geoscience data services (Ma et al. 2012). In the ongoing project the focus is an ontology for the regional geologic time scale of North America, in addition to the established ontology for the global geologic time scale (Cox and Richard 2015). The geologic time scale of North America has unique classification and terminology for the time intervals at the Epoch and Age levels; for the levels of Eon, Era and Period it shares the architecture with the global standard. As the terminology in the regional standard has been used in geoscience research of the North American region, specific terms in the regional standard can now also be used as keywords in data search, such as in queries sent to PBDB. In the ontology for the regional geologic time scale of North America, detailed information on all time intervals and their relationships were captured and represented in a machine-readable format. A Web-based visualization was then developed for the ontology, and interactive functions were developed to deploy the visualization as a control panel for data search. When a user clicks a time term in the panel, a query will be sent to PBDB, and the retrieved fossil records from PBDB will be plotted in a map window. Our project also set up connections to the USGS data services, so the user can load geologic map layers onto the map window and browse the background geologic information of a location where a fossil was discovered. The multi-source information has the potential to stimulate discussion among users and help them propose new research questions.

### *34.3.4 Cross-Disciplinary Collaboration for Innovative Discoveries*

In early 2015, a research project focused on the co-evolution of geo- and biospheres was kicked off at the Carnegie Institution of Washington (http://dtdi.carnegiescience.edu). The researchers in that project are from several universities and institutions and are with diverse knowledge backgrounds, making the research a real cross-disciplinary collaboration. The project proposed to deploy a data-driven abductive approach to discover patterns in the evolution of Earth's environment. A major task in the early stage of the project is to set up a Deep-Time Data Infrastructure (DTDI), which includes the enrichment of attributes (e.g. age information) in existing geo- and bio-databases, connections among geo-databases of petrology, mineralogy and geochemistry, the linkage between geo- and bio-databases, and open access and dissemination protocols for the built data infrastructure. Many open access data resources were considered for DTDI, including rruff.info (mineral species and properties), mindat.org (mineral species and localities), earthref.org (geochemistry and geomagnetism), geokem.com (igneous rock chemistry), metpetdb.rpi.edu (metamorphic petrology), earthchem.org (geochemistry, geochronology, petrology), vamps.mbl.edu (subsurface microbial ecosystem), pdb.org (protein structures), paleobiodb.org (paleobiology), and more. The user case-driven iterative method mentioned in Sect. 34.2.3 has been implemented to organize meetings and promote collaborations among researchers in the group. While the project is still ongoing, several interesting findings have already been achieved. One of them is the pattern of Large Number of Rare Events (LNRE) among the mineral species frequency distribution (Hystad et al. 2015). The work used the records of mineral species, localities and observations (species-locality pairs) from mindat.org and discovered the LNRE pattern. By extrapolating the domain of observation to be about four times the current size, the result in the LNRE model showed that there are about 1,500 new mineral species to be discovered. From that work, further studies on the population probabilities of all mineral species lead to the characterization of Earth-like planets, such as the Mars (Hystad et al. 2017).

As an affiliated scientist in the project mentioned above, the author led a project of using data visualization to study the co-relationships between mineral-forming elements and mineral species. The first study focused on a list of 30 key elements chosen by the research team (Ma et al. 2016). First, we built a 30 × 30 × 30 matrix and visualized it in a three-dimensional coordinate system, which made the matrix a fundamental framework to fill in records. Along each axis in this matrix we plotted the same arranged list of 30 elements as indices. Each cell in the matrix was first filled with the raw number of minerals in which elements X, Y, and Z coexist. A color spectrum was then applied to render each cell according to the value of the number in it. The process was intuitive, and the output in the three-dimensional matrix already showed interesting patterns in the co-relationships between elements and minerals. The visualized matrix was developed to be interactive in a web browser. Researchers can rotate the matrix and zoom into see details of a part, highlight a certain cell and see attributes in it, and slice one or more planes out from the matrix to see two-dimensional patterns. In another study, we extended the scale to all the 72 mineral-forming elements and constructed a 72 × 72 × 72 matrix. We then applied a chi-squared test to generate values to be filled and visualized in that matrix (Hummer et al. 2016). The mineralogical research question in that study was "Does the presence of element Z affect the correlation between elements X and Y in mineral species, and is the effect positive or negative?" Besides the completed case studies, many other interesting projects can be further developed with the three-dimensional matrix. For example, we can add data on electronegativity, ionic radius, atomic number, period, crustal abundance, etc. as associated parameters to each axis and test for different clustering of elements based those parameters.

### **34.4 Concluding Remarks**

Mathematical geosciences are now in an intelligent stage. As a research domain, mathematical geosciences share many topics in common with the data science of today. A topic of great interest in deploying data science for geoscience is how to generate research questions or hypotheses when massive datasets are already in existence. In this chapter, the role of exploratory data analysis was analyzed for that purpose, and it was compared with the data-driven abductive approach. Semantic Web and Open Data create a freshly new data environment for conducting geomathematical studies. The Web is built as an open space where Anyone can say Anything on Any topic. The Semantic Web aims to facilitate data Interoperability on the Web, to improve Interactivity between humans and machines, and to inspire Intercreativity for exploring new things. For informatics, a major objective is to present the Right information to the Right person in the Right way. We can use the acronym AIR3 to represent those nine words with initial capital letters. AIR3 presents a broad vision of deploying data science for geoscience in the context of the Semantic Web and Open Data. To put this into practice, we need to create a physical and/or virtual space and implement an approach where researchers from different disciplines can step out from their 'comfort zones', talk to each other, and collaborate on focused research topics.

**Acknowledgements** This work was partly supported by W. M. Keck Foundation, the National Science Foundation (NSF) through the NSF Idaho EPSCoR Program (award number IIA-1301792) and by the University of Idaho ORED 2017 Seed Grant Program.

### **References**


Tukey JW (1977) Exploratory data analysis. Addison-Wesley, Reading, PA, 688 pp

Varela S, González-Hernández J, Sgarbi LF, Marshall C, Uhen MD, Peters S, McClennen M (2015) paleobioDB: an R package for downloading, visualizing and processing data from the Paleobiology Database. Ecography 38(4):419–425

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **Chapter 35 Mathematical Morphology in Geosciences and GISci: An Illustrative Review**

**B. S. Daya Sagar**

**Abstract** Georges Matheron and Jean Serra of the Centre of Mathematical Morphology, Fontainebleau founded Mathematical Morphology (MM). Since the birth of MM in the mid 1960s, its applications in a wide ranging disciplines have illustrated that intuitive researchers can find varied application-domains to extend the applications of MM. This chapter provides a concise review of application of Mathematical Morphology in Geosciences and Geographical Information Science (GISci). The motivation for this chapter stems from the fact that Mathematical Morphology is one of the better choices to deal with highly intertwined topics such as retrieval, analysis, reasoning, and simulation and modeling of terrestrial phenomena and processes. This chapter provides an illustrative review of various studies carried out by the author over a period of 25 years—related to applications of Mathematical Morphology and Fractal Geometry—in the contexts of Geosciences and Geographical Information Science (GISci). However, the reader is encouraged to refer to the cited publications to gather more details on the review provided in an abstract manner.

### **35.1 Introduction**

A basic understanding of many geoscientific and geoengineering challenges across multiple spatial and/or temporal scales of terrestrial phenomena and processes is among the greatest of challenges facing contemporary sciences and engineering. Many space-time models explaining phenomena and processes of terrestrial relevance were descriptive in nature. Earlier, several toy models were developed via classical mathematics to explain possible phases in dynamical behaviors of complex systems. With the advent of computers with powerful graphics facilities, about three decades ago the interplay between numerical methods (generated via classical

B. S. Daya Sagar (✉)

Systems Science and Informatics Unit, Indian Statistical Institute-Bangalore Centre, 8th Mile, Mysore Road, RVCE PO, Bengaluru 560059, India e-mail: bsdsagar@isibang.ac.in

equations explaining the behaviors of dynamical systems) and graphics was shown to exist. That progress provided the initial impetus to visualize the systems' spatial and/or temporal behaviors that exhibit simple to complex patterns on graphical screens. One of the efficient ways of understanding the dynamical behavior of many complex systems of nature, society and science is possible through data acquired at multiple spatial and temporal scales. Data related to terrestrial (geophysical) phenomena at spatial and temporal intervals are available in numerous formats. The utility and application of such data could be substantially enhanced through related technologies documented in edited volumes and monographs of the recent past (Sagar 2001a, b, c, d, 2005a, b, 2009, 2013; Sagar and Rao 2003; Sagar et al. 2004; Sagar and Bruce 2005; Sagar and Serra 2010; Najman et al. 2012).

To understand the dynamical behavior of a phenomenon or a process, development of a good spatiotemporal model is essential. To develop a good spatiotemporal model, well-analyzed and well-reasoned information that could be extracted/retrieved from spatial and/or temporal data are important ingredients. Figure 35.1 shows a schematic illustrating the key links between the various phases where the involvement of Mathematical Morphology becomes obvious from the studies to be shown later in the chapter.

Mathematical Morphology—founded by Georges Matheron (1975) and Jean Serra (1982) has shown great impact in various fields including Geosciences and GISci—is one of the better choices to deal with all these key aspects mentioned. Mathematical morphology was founded by Georges Matheron (Agterberg 2001, 2004; Serra 1982, 1988). There are numerous representative publications related to mathematical morphology, to name a few: Serra (1982, 1988), Sternberg (1986), Beucher (1990, 1999), Soille (2003), Najman and Talbot (2010), Sagar (2013). Most notably, the comment on the issue of "What do Mathematical Geoscientists

**Fig. 35.1** Mathematical morphology applications in several phases of studies of relevance to geosciences and geographical information science


**Table 35.1** Successful applications of MM transformations in geosciences, geomorphology, GISci-major references

Do?" made by Harbaugh (2014) includes the importance of mathematical morphology of geological features in making predictions. In this chapter we outline the successful applications of the most important concepts of mathematical morphology (Table 35.1) in the context of geosciences and Geographical Information Science (GISci).

While perceiving the terrestrial surfaces including geophysical and geomorphic basins (e.g. using Digital Elevation Models, Digital Bathymetric Models, cloud fields, microscale rock porous media etc.) as functions, planar forms (e.g. topographic depressions, water bodies, and threshold elevation regions, hillslopes) as sets, and abstract structures (e.g. networks and watershed boundaries) as skeletons, we make attempts to unravel key links for better understanding spatiotemporal behaviors of several terrestrial and/or spatial phenomena and processes between the following coherent aspects: (i) terrestrial pattern retrieval, (Sect. 35.2) (ii) terrestrial pattern analysis, (Sect. 35.3) (iii) simulation and modeling, (Sect. 35.4) and (iv) geocomputing, visualization, spatial reasoning and planning (Sect. 35.5).

### **35.2 Terrestrial Pattern Retrieval**

Retrieving relevant information from precisely acquired spatial-temporal data of varied types about a specific complex system is a basic prerequisite to understand the spatial-temporal behavior of a system. Retrieval of information from a available spatiotemporal data acquired from a wide range of sources and a variety of formats, opens new horizons to the spatial statistical and geoscience communities. We have developed original spatial algorithms based on non-linear morphological transformations for retrieval of unique geophysical networks, mountain objects, segmentation of various geophysical objects, and pairing the geophysical spatial fields based on certain similarities (Sagar et al. 2000, 2003a, b; Sagar and Chockalingam 2004; Sathymoorthy et al. 2007; Chockalingam and Sagar 2003; Lim and Sagar 2008a, b; Lim et al. 2009, 2011; Sagar and Lim 2015a, b; Danda et al. 2016).

### *35.2.1 Mathematical Morphology in Extraction of Unique Topological Networks*

In contrast to other recent works, which have focused on extraction of channel networks via algorithms that fail to precisely extract networks from non-hilly regions (e.g. tidal regions), the algorithms we proposed can be generalized for application to both hilly (e.g. fluvial) and non-hilly (e.g. tidal) terrains, and also pore connectivity networks. These algorithms concerning the framework to extract multiscale geomorphologic networks via systematically decomposing elevation surfaces and/or decomposed threshold elevation regions into their abstract structures lead to valley and ridge connectivity networks. We proposed a framework to first decompose a binary fractal basin into fractal DEM from which two unique topological connectivity networks are extracted. These networks facilitate to segment Fractal-DEM (Fig. 35.2a) into sub-basins ranging from first to highest order (Fig. 35.2c). Results derived from a synthetic DEM (Fig. 35.2a) by applying one of these algorithms include unique topological connectivity networks similar to valley and ridge connectivity networks (Fig. 35.2b) and the hierarchically partitioned watersheds (Fig. 35.2c). We demonstrated the superiority of these stable algorithms which can be generalized to terrestrial surfaces of both fluvial and tidal types. This

**Fig. 35.2 a** simulated fractal DEM achieved through morphological decomposition procedure, **b** loop-like ridge connectivity and loopless channel connectivity networks, and **c** subbasins

work helps to solve basic problems that algorithms meant for extraction of unique terrestrial connectivity networks have faced for over three decades.

### *35.2.2 Retrieval of Morphologically Significant Regions*

Algorithms meant for morphological segmentation were demonstrated on a DEM, and mapped the physiographic features such as mountains, basins, and piedmont slopes from DEM (Fig. 35.3a); and the results are compared with that of other popular approaches (Fig. 35.3b).

Further, multiscale morphological opening was employed to segment binary fractal basins (Fig. 35.4a–c) that mimic geophysical basins, and cloud fields

**Fig. 35.3** Mountain pixels are the pixels in white, the piedmont pixels are the pixels in gray, and the basin pixels are the pixels in black. **a** The results obtained using the newly developed algorithm. **b** The results obtained in Miliaresis and Argialas (1999). (From Sathymoorthy et al. 2007)

**Fig. 35.4** Morphologically significant zones decomposed from **a** Koch triadic fractal island, **b** Random Koch triadic fractal island, **c** Random Koch quadric fractal island, **d** Isolated Moderate Resolution Imaging Spectroradiometer (MODIS) cloud (cloud-1), **e** Color-coded binarized (by choosing threshold gray level value 128) cloud-1 images at three threshold-opening cycles superimposed on binarized original cloud-1 color-coded with green, and **f** boundaries of 12th, 32nd, and 100th opened cloud-1 images and thresholded original cloud-1 superimposed on the original cloud image

isolated from MODIS data into topologically prominent regions (Fig. 35.4d–f). We proposed granulometry-based segmentation of geophysical fields (e.g. DEMs, clouds, etc.) with demonstration on binary fractals of deterministic and random types (Fig. 35.4a–c), and on cloud fields (Fig. 35.4d–f) that have different compaction properties with varied cloud properties.

The approach based on computation of complexity measures of morphologically significant zones decomposed from binary fractal sets via multiscale convexity analysis—which can be implemented on several geophysical and geomorphologic fields (e.g. DEMs, clouds, binary fractals etc.) to segment them into regions of varied topological significance—has been demonstrated on cloud fields derived from MODIS data to better segment the regions within the cloud fields that have different compaction properties with varied cloud properties. This approach of fundamental importance can be extended to several geophysical and geomorphologic fields to segment them into regions of varied topological significance.

**Fig. 35.5 a** Digital Elevation Model of size 256 × 256 pixels depicting Mount St Helens, **b**–**e** four quadrants of size 128 × 128 pixels partitioned from DEM (Fig. 35.5a) include top-left *<sup>f</sup>* <sup>1</sup> ð Þ, top-right *<sup>f</sup>* <sup>2</sup> ð Þ, bottom-left *<sup>f</sup>* <sup>3</sup> ð Þ, and bottom-right *<sup>f</sup>* <sup>4</sup> ð Þ portions

### *35.2.3 Ranking of Best Pairs of Spatial Fields*

A new metric to quantify the degree of similarity between any two given spatial fields is proposed (Sagar and Lim 2015a, b). This metric based on morphological operations can be used for image classification, in particular hyperspectral image classification, to derive best pair(s) of spatial fields from among a large number of spatial fields available in a database. In this proposed approach to compute the ranks for every possible pair of spatial fields (grayscale images) in a database, the two major computations involved include (i) estimation of grayscale morphological distance between the source and target spatial fields, and (ii) the ratios between the areas of infima and suprema of source and target spatial fields. Using this approach, four spatial elevation fields (Fig. 35.5b–e), in other words four quadrants partitioned from Fig. 35.5a could be paired into best pair (Fig. 35.6a), medium best pair (Fig. 35.6b), and the least best pair (Fig. 35.6c).

**Fig. 35.6** Three best ranked pairs of spatial elevation fields shown in Fig. 35.5b–<sup>e</sup> **<sup>a</sup>** *<sup>f</sup>* 1, *<sup>f</sup>* <sup>2</sup> ð Þ, **<sup>b</sup>** *<sup>f</sup>* 1, *<sup>f</sup>* <sup>3</sup> ð Þ, and **<sup>c</sup>** *<sup>f</sup>* 3, *<sup>f</sup>* <sup>4</sup> ð Þ

### **35.3 Terrestrial Pattern Analysis**

Quantitative analyses of terrestrial phenomena and processes is one of the innovative new directions of geoscientific research. Analysis of terrestrial patterns—that include water bodies, valley and ridge connectivity networks, watersheds, hillslopes, mountain objects, elevation fields—at various spatial and temporal scales is an important aspect to better understand the dynamical behaviors of various terrestrial processes and surfaces. Over the decades, various quantitative approaches have been developed and successfully demonstrated. Some of these approaches include morphometric analysis of river networks, hypsometry, allometry, and granulometric analyses, and geodesic spectrum based analysis. In this section, we show some results through illustrations arrived at via applications of mathematical morphology in (i) morphometric and allometric analyses of river networks and water bodies and their corresponding zones of influence, (ii) deriving scale-invariant but shape-dependant power laws, (iii) deriving basin-specific geodesic spectrum, and (iv) DEM analysis.

### *35.3.1 Morphometry and Allometry of Networks*

Towards analyzing terrestrial surfaces we have shown unique ways to quantitatively characterize the spatiotemporal terrestrial complexity via scale-invariant measures that explain the commonly sharing physical mechanisms involved in terrestrial phenomena and processes. These contributions (Sagar and Rao 1995a, b, c, d; Sagar 1996, 1999a 2000a, b, 2001a, b, c, d 2007; Sagar et al. 1998a, b, 1999; Sagar and Tien 2004; Chockalingam and Sagar 2005; Tay et al. 2005a, b, c) highlighted the evidence of self-organization via scaling laws—in networks, hierarchically decomposed subwatersheds, and water bodies and their zones of influence, which evidently belong to different universality classes—which possess excellent agreement with geomorphologic laws such as Horton's Laws, Hurst exponents, Hack's exponent, and other power-laws given in non-geoscientific contexts. A host of allometric power-law relationships were derived that were in good accord with other established network models and real networks (Figs. 35.7, 35.8 and 35.9).

### *35.3.2 Allometry of Water Bodies and Their Zones of Influence*

Topologically, water bodies (Fig. 35.10a) are the first level topographic regions that get flooded, and as the flood level gets higher, adjacent water bodies merge. The looplike network that forms along all these merging points represents zones of influence (Fig. 35.10b) of each water body. The geometric organizations of these

**Fig. 35.7 a** An example of fourth-order channel network (nonconvex set) and **b** its convex hull. A stationary outlet is shown as a round dot in Fig 35.7a. **c** color-coded traveltime network pruned iteratively until it reaches the outlet and **d** color-coded union of convex hulls of networks pruned to different degrees

two phenomena are respectively sensitive and insensitive to perturbation due to exogenic processes. To demonstrate the allometric relationships of water bodies and their zones of influence, a large number of surface water bodies (irrigation tanks), situated in the floodplain region of certain rivers of India, which are retrieved from multi-date remotely sensed data were analyzed in 2-D space (Sagar et al. 1995a, b). Basic measures of these water bodies obtained by morphological analysis were employed to show fractal-length-area-perimeter relationships.

We found that these phenomena follow the universal scaling laws (Sagar et al. 2002; Sagar 2005a, b) found in other geophysical and biological contexts. In this work, universal scaling relationships among basic measures such as area, length, diameter, volume, and information about networks are exhibited by several natural phenomena to further retrieve and understand the common principles underlying organization of these phenomena. Some of the recent findings on universal scaling relations include relationships between brain and body, length and area (or volume),

**Fig. 35.8** Networks in **a** three-sided fractal basin, **b** four-sided fractal basin, **c** five-sided fractal basin, **d** six-sided fractal basin, **e** seven-sided fractal basin, **f** eight-sided fractal basin, and **g** Nizamsagar reservoir. (From Sagar et al. 1998a, b, 1999, 2001)

**Fig. 35.9 a** sub-basins decomposed from a Hortonian F-DEM areas, and **b** corresponding main lengths

**Fig. 35.10 a** A section consisting of a large number of small water bodies traced from the floodplain region of Gosthani River and **b** zones of influence of water bodies shown in Fig. 35.10a. Different colors are used to distinguish the adjacent influence zones

size and number, size and metabolic rate. In this study, we have shown a host of universal scaling laws in surface water bodies (Fig. 35.10a) and their zones of influence (Fig. 35.10b) that have similarities with several of these relationships encountered in various fields are shown.

### *35.3.3 Morphometry of Non-network Space: Scale Invariant but Shape-Dependent Dimension*

In sequel works on terrestrial analysis, we argued that the universal scaling laws shown as examples in earlier section possess limited utility in exploring possibilities to relate them with geomorphologic processes. These arguments formed the basis for alternative methods (Radhakrishna et al. 2004; Teo et al. 2004; Sagar and Chockalingam 2004; Chockalingam and Sagar 2005; Tay et al. 2005a, b, 2007). Shape and scale based indexes provided to analyze and classify non-network space (hillslopes) (Sagar and Chockalingam 2004; Chockalingam and Sagar 2005), and terrestrial surfaces (Tay et al. 2005a, b, 2007) received wide attention. These methods that preserve the spatial and morphological variability yield quantitative results that are scale invariant but shape dependent, and are sensitive to terrestrial surface variations. "Fractal dimension of non-network space of a catchment basin",

**Fig. 35.11 a** Apollonian space, and **b** after decomposition by means of octagon

provides an approach to show basic distinction between the topologically invariant geomorphologic basins. It introduced morphological technique for hillslope decomposition that yields a scale invariant, but shape dependent, power-laws (Fig. 35.11a, b).

Varied degrees of topographically convex regions within a catchment basin represent varied degrees of hill-slopes. The non-network space, the characterization of which we focused on in our investigations, is akin to the space that is achieved by subtracting channelized portions contributed due to concave regions from the watershed space. This non-network space is akin to non-channelized convex region within a catchment basin. We proposed an alternative shape-dependent quantity akin to fractal dimension to characterize this non-network space (e.g.: Fig. 35.12a). Towards this goal, non-network space is decomposed, in two- dimensional discrete space, into simple non-overlapping disks (NODs) of various sizes by employing mathematical morphological transformations and certain logical operations (Fig. 35.12b). Furthermore, number of NODs of lesser than threshold radius is plotted against the radius, and computed the shape-dependent fractal dimension of non-network space. This study was extended to derive shape dependent scaling laws as the laws derived from network measurements are shape independent for realistic basins (Fig. 35.12). The relationship between number of NODs and the radius of the disk provides an alternative fractal-like dimension that is shape dependent. This was done with the aim to relate shape dependent power laws with geomorphic processes such as hill-slope processes and erosion.

Applications of mathematical morphology transformations are shown to decompose fractal basins (e.g.: Fig. 35.11a) into non-overlapping disks of various sizes (Fig. 35.11b) further to derive fractal power-laws based on number-radius relationships.

**Fig. 35.12 a** 5th order channel network **c** of Durian Tungal catchment basin, basin X is reconstructed from this channel network via multiscale morphological closing transformation, **b** M = X\C

### *35.3.4 Geodesic Spectrum*

We have provided a novel geomorphologic indicator by simulating geodesic flow fields (Fig. 35.13d–f) within basins (Fig. 35.13a–c) consisting of spatially distributed elevation regions (Lim and Sagar 2008a, b), further to compute a geodesic spectrum that provides a unique one-dimensional geometric support.

This one-dimensional geometric support, in other words geodesic spectrum, outperforms the conventional width–function based approach which is usually derived from planar forms of basin and its networks–construction involves basin as

**Fig. 35.13 a** a flat circular basin, **b** a basin with three spatially distributed elevation regions, **c** a fractal basin with channelised and non-channeled regions **d** flow fields with isotropic propagation in **a**, **e** isotropic flow fields within **b**, and **f** flow fields within **c** and orthogonality between the flow fields of channelized and non-channelized zones is obvious. (From Lim and Sagar 2008a, b.)

**Fig. 35.14** Basin 1 of Cameron Highlands is taken as an example to show the basin images at multiple scales generated via closing and opening. Basin 1 is located at the northern part of Cameron Highlands region, with a size of 3.1 km (east to west) 63.4 km (north to south). (Upper sequence) DEM at multiple scales generated via opening, and (Lower Panel) multiscale DEMs generated via closing

a random elevation field (e.g. Digital Elevation Model, DEM) and all threshold elevation regions decomposed from DEM for understanding the shape-function relationship much better than that of width function.

### *35.3.5 Granulometric and Anti-granulometric Analysis of Basin-DEMs*

Granulometric indexes derived for spatial elevation fields also yield scale invariant but shape-dependent measures (Tay et al. 2005a, b, c, 2007). DEMs are analyzed by following granulometries via multiscale opening (Fig. 35.14 upper panel), and antigranulometries (Fig. 35.14 lower panel) to derive shape-size complexity measures of foreground and background respectively that provide new indices to understand the terrestrial surfaces further to relate with several geomorphic processes.

### **35.4 Geomorphologic Modeling and Simulation**

Simulations allow us to gain a significantly good understanding of complex geomorphologic systems in a way that is not possible with lab experiments. Effectively attaining these goals presents many computational challenges, which include the development of frameworks. The robustness of mathematical morphological operators combined with concepts of fractal geometry (Mandelbrot 1982) in modeling and simulations of certain geoscientific phenomena and processes is shown briefly with illustrative examples in this section. The phenomena and processes given emphasis in this section include geomorphologic features, basins and channel networks, landscapes, water bodies, symmetrical folds and ideal sand dunes. Besides providing approaches to simulate fractal-skeletal based channel network model and fractal landscapes, we have shown via the discrete simulations the varied dynamical behavioral phases of certain geoscientific processes (e.g. water bodies, ductile symmetric folds, sand dunes, landscapes) under nonlinear perturbations due to *endogenic* and *exogenic* nature of forces. For these simulations we employed nonlinear first order difference equations, bifurcation theory, fractal geometry, and nonlinear morphological transformations as the bases. The three complex systems that we focus on include the channelization process, surface water bodies, and elevation structures.

### *35.4.1 Geomorphologic Modeling: Concept of Discrete Force*

Concept of discrete force was proposed from theoretical standpoint to model certain geomorphic phenomena, where geomorphologically realistic expansion and contractions, and cascades of these two transformations were proposed, and five laws of geomorphologic structures are proposed (Sagar et al. 1998a, b). A possibility to derive a discrete rule from a geomorphic feature (e.g. lake) undergoing morphological changes that can be retrieved from temporal satellite data was also proposed in this work, and explained (Fig. 35.15). Laws of geomorphic structures under the perturbations are provided and shown, through interplay between numerical simulations and graphic analysis as to how systems traverse through various behavioral phases.

**Fig. 35.15 a** Hypothetical geomorphic feature at time t, **b** geomorphic feature at time *t* + 1, and **c** difference in geomorphic feature from time *t* to *t* + 1

### *35.4.2 Fractal-Skeletal Based Channel Network Model*

Our work on channel network modelling Gastner and Newman (2004) and Sagar (2001c) represents unique contributions to the literature, which until recently were dominated by the classic random model. Fractal-skeletal based channel network model (F-SCN) was proposed by following certain postulates. We developed the Fractal-Skeletal Channel Network (F-SCN) model by employing morphological skeletonization to construct other classes of network models, which can exhibit various empirical features that the random model cannot. In the F-SCN model that gives rise to Horton laws, the generating mechanism plays an important role. Homogeneous and heterogeneous channel networks can be constructed by symmetric generator with non-random rules, and symmetric or asymmetric generators with random rules. Subsequently, F-SCNs (Fig. 35.16d–f) in different shapes of fractal basins (Fig. 35.16a–c) are generated and their generalized Hortonian laws (Fig. 35.16g, h) are computed which are found to be in good accord with other established network models such as Optimal Channel Networks (OCNs), and realistic rivers. F-SCN model is extended to generate more realistic dendritic branched networks.

### *35.4.3 Fractal Landscape via Morphological Decomposition*

By applying morphological transformations on fractals of varied types are decomposed into topologically prominent regions (TPRs) (Fig. 35.17a) and each TPR is coded and a fractal landscape organization that is geomorphologically realistic is simulated (Fig. 35.17b) (Sagar and Murthy 2000).

**Fig. 35.16 a**, **b** and **c** Fractal basins after respective iterations. **d**, **e** and **f** An evolutionary sequence of F-SCNs after respective iterations, **g** Horton's law of number, and **h** Horton's law of mean length

**Fig. 35.17 a** A binary fractal basin after decomposition into TPRs **b** A fractal landscape generated from Fig. 35.17a. Light and dark regions of DEM are visualized as high and low elevations (vertical exaggeration: 7)

### *35.4.4 Discrete Simulations and Modeling the Dynamics of Small Water Bodies, Symmetrical Folds, and Sand Dunes*

In this subsection we show the fusion of computer simulations and modeling techniques in order to better understand certain terrestrial phenomena and processes with the ultimate goal of developing cogent models in discrete space further to gain a significantly good understanding of complex terrestrial systems in a way that is not possible with lab experiments. The three synthetic phenomena that are explained by generating attractors considered include water bodies (Sagar and Rao 1995a, b, c), symmetrical folds (Sagar 1998), and sand dunes (Sagar 1999b, 2000a, b, 2001a, 2005a, b; Sagar and Venu 2001; Sagar et al. 2003a, b).

### **35.4.4.1 Discrete Simulations and Modeling the Dynamics of Small Water Bodies**

Spatio-temporal patterns of small water bodies (SWBs) under the influence of temporally varied streamflow discharge behaviors are simulated in discrete space by employing geomorphologically realistic expansion and contraction transformations (Fig. 35.18). Expansions and contractions of SWBs to various degrees (e.g. Fig. 35.18B g–l), which are obvious due to fluctuations in streamflow discharge pattern (Fig. 35.18A, a–f), simulate the effects respectively owing to streamflow discharge that is greater or less than mean streamflow discharge. The cascades of expansion-contraction are systematically performed by synchronizing the streamflow discharge (Fig. 35.18A, a–f), which is represented as a template with definite

**Fig. 35.18 A** Streamflow discharge behavioral pattern at different environmental parameters. **a**–**f** λ = 1, 2, 3, 3.46, 3.57 and 3.99, and **B** Spatio-temporal organization of the surface water bodies under the influence of various streamflow discharge behavioral patterns at the environmental parameters at **a**–**f** λ = 1, 2, 3, 3.46, 3.57, and 3.99 are shown up to 20 time steps. In all the cases, the considered initial MSD, A0 = 0.5 (in normalized scale) is considered under the assumption that the water bodies attain their full capacity. It is illustrated only for the overlaid outlines of water bodies at respective time-steps with various λs

characteristic information, as the basis to model the spatio-temporal organization of randomly situated surface water bodies of various sizes and shapes.

We have shown the varied dynamical behavioral phases of certain geoscientific processes (e.g. water bodies) under nonlinear perturbations via the discrete simulations.

#### **35.4.4.2 Ductile Symmetrical Fold Dynamics**

Under various possible time-dependent and time-independent strength of control parameter, in other words nonlinear perturbations, the three-limb symmetrical folds are transformed in a time sequential mode to simulate various possible fold dynamical behaviors (Fig. 35.19a, b) synchronizing trajectory behavior simulated via logistic equation with strength nonlinearity parameters 3.9 and 2.8 (Fig. 35.20a, b). We employed normalized fractal dimension values, and interlimb angles (IAs) as parameters along with strength of nonlinear parameters in this study. Bifurcation

**Fig. 35.19** Evolution of a fold type with the strength of nonlinearities: **a** λ = 3.9 and **b** λ = 2.8. The numbers represent the discrete times. (From Sagar 1998)

diagrams are constructed for both time-dependent and time-independent fold dynamical behaviors, and the equations to compute metric universality by considering the interlimb angles computed at threshold strengths of nonlinearity parameters are proposed (Sagar 1998).

#### **35.4.4.3 Symmetrical Sand Dune Dynamics**

Certain possible morphological behaviors with respective critical states represented by inter-slip face angles of a sand dune under the influence of non systematic processes are qualitatively illustrated by considering the first order difference equation that has the physical relevance to model the morphological dynamics of the sand dune evolution as the basis. It is deduced that the critical state of a sand dune under dynamics depends on the regulatory parameter that encompasses exodyanmic processes of random nature and the morphological configuration of sand dune. With the aid of the regulatory parameter, and the specifications of initial state of sand dune, morphological history of the sand dune evolution can be investigated. As an attempt to furnish the interplay between numerical experiments and theory of morphological evolution, the process of dynamical changes (Fig. 35.21) in the sand dune with a change in threshold regulatory parameter (e.g. Fig. 35.22) is modeled qualitatively for a better understanding. An equation to compute metric universality by considering attracting interslipface angles is also proposed. Avalanche size distribution in such a numerically simulated sand dune dynamics have also been studied.

**Fig. 35.20** Logistic maps for the qualitative dynamical behavior of symmetric folds under evolution shown in Fig. 35.19a, b. It may be seen that the values mentioned on the abscissa are IAs in degrees for the symmetric fold with three limbs. (From Sagar 1998)

**Fig. 35.21 a** Initial sand dune profile with α = 0.00001 or θ = 179.57334. The attractor sand dune profiles at various threshold regulatory parameters: **b** λ = 3, fixed point attractor sand dune; **c** λ = 3.46, period 2 attractor sand dunes; **d** λ = 3.569, period 4 attractor sand dunes; and **e** λ = 3.57, period 8 attractor sand dunes. The attractor sand dune profiles shown in **b**–**e** are by iterating 3 × 104 time steps. (From Sagar 1999a, b)

### **35.5 Geospatial Computing and Visualization**

Mathematical morphology not only provides robust solutions in terrestrial pattern retrieval, analysis, and modeling and simulations but also provides numerous insights worth exploring to find solutions for the challenges encountered in GISci. In recent works—that include (i) binary and grayscale morphological interpolations,

**Fig. 35.22 a** A 1-D map plotted between θt+1 versus θ<sup>t</sup> for sand dune case λ = 4 and **b** return map plotted between θt+1 − θ<sup>t</sup> versus θt+2 − θt+1 for sand dune case with λ = 4. (From Sagar et al. 2003a, b; Sagar and Venu 2001)

SKIZ, WSKIZ and applications in spatiaotemporal visualizations, conversion of point-specific variable data into contiguous zonal maps (Rajashekara et al. 2012), morphing (Sagar and Lim 2015a, b) and variable-specific cartogram generation (Sagar 2014a, b), (ii) volumetric visualization of topologically significant components such as pore-bodies, pore-throats, and pore-channels (Teo and Sagar 2005, 2006), and (iii) spatial reasoning, planning, and interactions (Sagar et al. 2013; Vardhan et al. 2013; Sagar 2018)—one can realize on how robust approaches could be developed by considering mathematical morphological transformations.

### *35.5.1 Morphological Interpolations*

This subsection provides the applications of binary and grayscale morphological interpolations in hierarchical computation of morphological medians and in morphing, and the applications of SKIZ and WSKIZ in conversion of point-specific variable data into contiguous zonal map, and generation of variable-specific contiguous cartograms.

### **35.5.1.1 Computation of Hierarchical Morphological Medians**

Hausdorff-distance based (i) spatial relationships between the maps possessing bijection for categorization and (ii) nonlinear spatial interpolation in visualization of spatiotemporal behavior are proposed and demonstrated. This work (Sagar 2010, 2014a, b; Challa et al. 2016) concerns the development of frameworks with a goal to understand spatial and/or temporal behaviors of certain evolving and dynamic geomorphic phenomena. In (Sagar 2010), we have shown (i) how Hausdorff-Dilation and Hausdorff-Erosion metrics could be employed to categorize the time-varying spatial phenomena, and (ii) how thematic maps in time-sequential mode (Fig. 35.23a) can be used to visualize the spatiotemporal behaviour of a phenomenon, by recursive generation of median elements (Fig. 35.23b). Spatial interpolation, that was earlier seen as a global transform, is extended in Lim and Sagar (2008) by introducing *bijection* to deal with even connected components. This aspect solves problems of global nature in spatial-temporal GIS. Spatial Interpolation technique is found useful for spatial-temporal GIS and is demonstrated with validation on epidemic spread maps collected for eleven years between 1896 and 1906 (Fig. 35.23a–k, upper left panel). Morphological medians are computed between the epidemic spread maps staggered at two-year interval (Fig. 35.23a–k, upper right panel). Further morphological medians are computed in a hierarchical manner between every two epidemic spread maps of successive years (Fig. 35.23a, b in the lower panel).

### **35.5.1.2 Grayscale Morphological Interpolation and Morphing**

The computation of morphological medians between the thematic maps (binary images) demonstrated in the earlier subsection could be extended to the spatial fields (functions, e.g.: DEMs). This extended version is termed as grayscale morphological interpolation. We have demonstrated the application of grayscale morphological interpolations, computed hierarchically between the spatial fields (Fig. 35.24), to metamorphose a source-spatial field into a target-spatial field. Grayscale morphological interpolations are computed in a hierarchical manner

**Fig. 35.23** (Upper-Left Panel) **a**–**k** Spatial temporal maps that represent the geographic spread of bubonic plague in India between 1896 and 1906 at intervals of one year Maragos and Schafer (1986). The 11 spatial maps depicting the spread of plague were sequentially used to generate the maximum possible number of interpolated maps; (Upper right panel) **a** Original spatial map of the bubonic plague during 1896. **b**–**j** The first level median sets computed for M(Xt , Xt+2) for all "*t*" ranging from 1896 to 1905. **k** Original spatial map during 1906. For validation, the maps of Fig. b–j of upper left panel obtained as first-level median sets are, M(X<sup>t</sup> , Xt+2) respectively, compared for all "*t*" with those *t* of Fig. 35.23b–j of upper left panel. These first-level median sets show a reasonable matching with the actual sets (Fig. 35.23b–j of upper left panel); (Lower Panel) Superimposed gray-coded **a** original spatial maps and **b** spatial maps generated via median set computations

(Fig. 35.25) with respect to non-flat structuring element, and found that the morphing, shown for transform source-spatial field into target-spatial field, created with respect to non-flat structuring element is more appropriate as the transition of source-spatial field into the target-spatial field across discrete time steps is smoother than that of the morphing shown with respect to flat structuring element (Sagar and Lim 2015a, b). This morphing shown via nonlinear grayscale morphological interpolations is of immense value in geographical information science, and in particular spatiotemporal geo-visualization.

**Fig. 35.24** Smaller regions of DEMs: **a** Cameron Highlands, and **b** Petaling region

**Fig. 35.25** Generation of morphological medians generated by non-flat structuring element, between the DEMs shown in (**a**) and (**i**), at **b** zeroth level, **c**, **d** first level, and **e**–**h** second level

### **35.5.1.3 Point-to-Polygon Conversion via WSKIZ**

Data about many variables are available as numerical values at specific geographical locations in a noncontiguous form. We develop a methodology based on mathematical morphology to convert point-specific data into polygonal data. This methodology relies on weighted skeletonization by zone of influence (WSKIZ). This WSKIZ determines the points of contact of multiple frontlines propagating, from various points (e.g.: gauge stations) spread over the space, at the travelling rates depending upon the variable's strength. We demonstrate this approach for converting rainfall data available at specific rain gauge locations (points) (Fig. 35.26a) into a polygonal map (Fig. 35.26b) that shows spatially distributed zones of equal rainfall in a contiguous form (Rajashekara et al. 2012).

### **35.5.1.4 Cartograms via WSKIZ**

Visualization of geographic variables as spatial objects of size proportional to variable strength is possible via generating variable-specific cartograms. We developed a methodology based on mathematical morphology to generate contiguous cartograms. This approach determines the points of contact of multiple frontlines propagating, from centroids of various planar sets (states), at the travelling rates depending upon the variable's strength (Fig. 35.27a–d).

The contiguous cartogram generated via this algorithm preserves the global shape, and local shapes, and yields minimal area-errors. It is inferred from the comparative error analysis that this approach could be further extended by

**Fig. 35.26 a** 34 points (locations) of rain-gauge stations spread over India indexed (A1–A34), **b** Rainfall zonal map generated by having various possible propagation speeds, and the variable strengths in terms of propagation speeds

**Fig. 35.27** The variable strengths (in terms of propagation speeds are given as **a** *A*<sup>2</sup> >*A*<sup>4</sup> >*A*<sup>1</sup> >*A*3, **b** *A*<sup>2</sup> >*A*<sup>1</sup> >*A*<sup>3</sup> >*A*4, **c** *A*<sup>1</sup> >*A*<sup>3</sup> >*A*<sup>2</sup> >*A*4, and **d** *A*<sup>1</sup> >*A*<sup>4</sup> >*A*<sup>2</sup> >*A*<sup>3</sup>

exploring the applicability of additional characteristics of structuring element, which controls the dilation propagation speed and direction of dilation while generating variable-specific cartograms, to minimize the local shape errors, and area-errors. This algorithm addresses a decade-long problem of preservation of global and local shapes of cartograms. This approach was extended to generate a cartogram for a variable population to demonstrate the proposed approach. Further, the population cartograms for the USA generated via four other approaches (Kocmoud 1997; Keim et al. 2004; Gastner and Newman 2004; Gusein-Zade and Tikunov 1993) are compared with the morphology-based cartogram (Fig. 35.28a–f) in terms of errors with respect to area, local shape, and global shape. This approach for generating cartograms preserves the global shape at the expense of compromising with area-errors. It is inferred from the comparative error analysis that the proposed morphology-based approach could be further extended by exploring the applicability of additional characteristics of probing rule, which controls the dilation propagation speed and direction of dilation while performing WSKIZ, to minimize the local shape errors, and area-errors.

### *35.5.2 Visualization of Topological Components in a Volumetric Space*

Heterogeneous material is one that is composed of domains of different materials (phases). The aim of this module is to show how geometric descriptors derived via mathematical morphology and fractal analysis vary between the porous phases isolated from varied types of rocks at various spatial and spectral scales. It is evident from the recent works on Fontainebleau sandstone that the characteristics derived through computer assisted mapping and computer tomographic analysis were well correlated with the physical properties such as porosity, permeability, and conductance. Whatever the physical processes involved in altering the porous phase of material, we propose to emphasise quantifying the complexity of porous phase in both 2-D and 3-D domains. From a petrologic study perspective, such a quantitative characterization in both two- and three-dimensional spaces is of current interest.

**Fig. 35.28 a** Equal-area-projection map of USA. **b**–**e** Population cartograms generated for USA based on **b** Continuous cartogram (Kocmoud 1997), **c** cartodraw (Keim et al. 2004), **d** Gastner-Newman cartogram (Gastner and Newman 2004), **e** Area cartogram of the United States, with each county rescaled in proportion to its population (Gusein-Zade and Tikunov 1993), and **f** morphology-based cartogram (Sagar 2014a, b). U.S. population cartogram by Gusein-Zade and Tikunov (e: Reproduced with permission from Gusein-Zade and Tikunov 1993, page 172, Fig. 35.1, © 1993 American Congress on Surveying and Mapping). The color coding given in Fig. 35.28a is similar to that of Fig. 35.28f

Just like how CT scanning mechanism is employed to scan the brain to study several neurophysiologic processes, one can also employ such a CT-scanning mechanism, besides already existing scanning methods, to scan the rock bodies and store the scanned information in layered forms. Each layer depicts rock's cross sectional information at specific depth. Retrieval of three significant geometric and/ or topologic components, describing organisation of porous medium, that include (a) pore channel, (b) pore throat, and (c) pore body in both 2-D and 3-D spaces is an important task. A 3-D fractal pore (Fig. 35.29a, b) simulated in such a way that it mimics the stacked layers of pore sections is converted into 3-D pore channel network (Fig. 35.29c, d), 3-D pore throats (Fig. 35.29e, f) and 3-D pore bodies (Fig. 35.29g, h). These decomposed pore features that are of topological importance would shed the light to derive geometric relations which further can be related with that of physical properties of porous structure.

**Fig. 35.29** Top and side views of **a**, **b** model 3D fractal binary pore, **c**, **d** pore-channel, **e**, **f** pore-throat, and **g**, **h** pore-body. (*Source* Teo and Sagar 2006)

### *35.5.3 Spatial Reasoning and Planning*

Mathematical morphology based algorithms developed and demonstrated shown in this subsection include to determine (i) strategically significant set(s) for spatial reasoning and planning, (ii) directional spatial relationship between areal objects (e.g.: lakes, states, sets) via origin-specific dilations, and (iii) spatial interactions via modified gravity model.

### **35.5.3.1 Strategically Significant State (S)**

Identification of a strategically significant set from a cluster of adjacent and/or non-adjacent sets depends upon the parameters that include size, shape, degrees of adjacency and contextuality, and distance between the sets. An example of cluster of sets includes continents, countries, states, cities, etc. The spatial relationships, deciphered via the parameters cited above, between such sets possess varied spatial complexities. Hausdorff dilation distance between such sets is considered to derive automatically the strategic set among the cluster of sets. The (i) dilation distances, (ii) length of boundary being shared, and (iii) degrees of contextuality and adjacency between origin-set and destination sets, which together provide solutions to derive strategically significant sets with respect to distance, degree of contextuality, degree of adjacency and length of boundary being shared. Simple mathematical morphologic operators and certain logical operations are employed in this study. Results drawn (Fig. 35.30)—by applying the proposed framework on a case study that involves spatial sets (states) decomposed from a spatial map depicting the country of India—are shown in Fig. 35.30.

This approach has been applied on data depicting randomly spread surface water bodies (Fig. 35.31a, b) and their corresponding zones of influence (Fig. 35.31c, d) within a subbasin to detect the strategically significant water body and zone of influence (Fig. 35.32a, b).

#### **35.5.3.2 Directional Spatial Relationship**

We provide an approach to compute origin-specific morphological dilation distances between planar sets (e.g.: areal objects, spatially represented countries, states, cities, lakes) to further determine the directional spatial relationship between sets. Origin chosen for a structuring element that yields shorter dilation distance than that of the other possible origins of structuring element determines the directional spatial relationship between *A*<sup>i</sup> (origin-set) and *A*<sup>j</sup> (destination set). We demonstrate this approach on a cluster of spatial sets (states) decomposed from a spatial map depicting country India (Fig. 35.33a). This approach has potential to extend to any number (type) of sets on Euclidean space.

**Fig. 35.30 A** Map of India (spatial system) with its constituent 28 states (subsets)—indexed according to alphabetical order are shown—Andhra Pradesh (A1), Arunachal Pradesh (A2), Assam (A3), Bihar (A4), Chhattisgarh (A5), Goa (A6), Gujarat (A7), Haryana (A8), Himachal Pradesh (A9), Jammu & Kashmir (A10), Jharkhand (A11), Karnataka (A12), Kerala (A13), Madhya Pradesh (A14), Maharashtra (A15), Manipur (A16), Meghalaya (A17), Mizoram (A18), Nagaland (A19), Orissa (A20), Punjab (A21), Rajasthan (A22), Sikkim (A23), Tamilnadu (A24), Tripura (A25), Uttarapradesh (A26), Uttarakhand (A27), West Bengal (A28), Union territories and Himalayan hill range that are parts Indian peninsular are not included in the figure. **B** Spatial representation of strategically important states in the order from 1 to 10 are shown in terms of twelve different parameters shown in Fig. 35.7. In each panel of this Figure, first 10 strategically significant states (please refer to the legend on each panel) are shown in different colors. These strategically significant sets with respect to **a** boundary being shared, **b** shortest distance from origin to destination states, **c** shortest total distance from destination states to origin state, **d** contextuality, **e** Hausdorff dilation distance, **f** spatial complexity involved in length of the boundary being shared, **g** spatial complexity in terms of contextuality, **h** spatial complexity in terms of distance from origin to destination states, **i** spatial complexity in terms of distance from destination states to origin state, **j** spatial complexity in terms of Hausdorff dilation distance from origin state to destination states. States with color-codes denote first ten strategically significant states, and the region with white space represents the states that are strategically non-significant with ranks starting from eleven to twenty eight

#### **35.5.3.3 Spatial Interactions**

Hierarchical structures include spatial system (e.g. river basin), clusters of a spatial system (e.g. watersheds of a river basin), zones of a cluster (e.g. subwatersheds of a watershed), and so on. Variable-specific classification of the zones of a cluster of zones within a spatial system is the main focus of this work on spatial interactions. Variable-specific (e.g. resources) classification of zones is done by computing the levels of interaction between the *i*th and *j*th zones. Based on a heuristic argument,

**Fig. 35.31 a** Indian Remote Sensing satellite (IRS LISS-III) multispectral image of the study area, and the blue objects are water bodies traced from IRS LISS-III image with topographic map reference superposed on IRS LISS-III image, and white dots indicate the boundary of the considered cluster, **b** small water bodies, **c** zones of influence of corresponding water bodies, and **d** water bodies and zones of influence with labeling

**Fig. 35.32** Spatially significant **a** water body with label 35 (Red Color), and **b** zone of water body influence labeled with 35 (Red Color)

**Fig. 35.33 a** Twenty nine sets (states of India) indexed according to alphabetical order are shown —Gujarat (*A*1), Rajasthan (*A*2), Maharashtra (*A*3), Goa (*A*4), Karnataka (*A*5), Kerala (*A*6), Madhya Pradesh (*A*7), Jammu and Kashmir (*A*8), Punjab (*A*9), Haryana (*A*10), Tamilnadu (*A*11), Andhra Pradesh (*A*12), Himachal Pradesh (*A*13), Delhi (*A*14), Uttar Pradesh (*A*15), Uttaranchal (*A*16), Chhattisgarh (*A*17), Orissa (*A*18), Bihar (*A*19), Jharkhand (*A*20), West Bengal (*A*21), Sikkim (*A*22), Assam (*A*23), Meghalaya (*A*24), Tripura (*A*25), Arunachal Pradesh (*A*26), Mizoram (*A*27), Manipur (*A*28), Nagaland (*A*29). Union Territories are not considered. **b** Directional spatial relationship shown in colored matrix form in which there are 29 rows and 29 columns and a color in each grid cell explaining directional relationship between each state to other 28 states

we proposed a modified gravity model for the computations of levels of interaction between the zones. This argument is based on the following two facts: (i) the level of interaction between the *i*th and *j*th zones, with masses *m*<sup>i</sup> and *m*<sup>j</sup> is direction-dependent, and (ii) the level of interactions between the *i*th and *j*th zones with corresponding masses, situated at strategically insignificant locations would be much different (lesser) from that of the *i*th and *j*th zones with similar masses but situated at strategically highly significant locations. With the support of this argument, we provide a modified gravity model by incorporating the asymmetrical distances, and the product of location significance indexes of the corresponding zones. This modified gravity model yields level of interaction between the two zones that satisfies the realistic characteristic that is level of interaction between the zones is direction-dependent.

Each state of India is designated with ranks in terms of its (i) location significance index, (ii) strengths of interaction of all states with a specific state, (iii) strengths of interaction with other states, and (iv) strength out of (ii) and (iii) (Fig. 35.34a–d). Further by employing a modified gravity model, 28 states (X1 to X28) of India (Fig. 35.30A) are paired into best interacting to least interacting pairs with respect to areal extents of states as a variable (Fig. 35.35a–j).

**Fig. 35.34** India map with each state designated with a rank with respect to four different parameters. **<sup>a</sup>** *<sup>φ</sup>Xi*, **<sup>b</sup>** max*<sup>i</sup>* <sup>∑</sup>*<sup>j</sup> FXij* , **<sup>c</sup>** max*<sup>j</sup>* <sup>∑</sup>*<sup>i</sup> FXji* , and **<sup>d</sup>** max max*<sup>i</sup>* <sup>∑</sup>*<sup>j</sup> FXij* , max *<sup>j</sup>* <sup>∑</sup>*<sup>i</sup> FXji*

**Fig. 35.35** Five best pairs exhibited the high levels of interactions **a** *X*20, 5, **b** *X*14, 26, **c** *X*26, 27, **d** *X*14, 5, and **e** *X*1, 20. Five pairs exhibited the least levels of interactions **f** *X*6, 25, **g** *X*25, 6, **h** *X*6, 19, **i** *X*6, 23, and **j** *X*23, 6

### **35.6 Conclusions**

From our attempts since early 1990s, we could clearly see a great potential for mathematical morphological transformations in the three aspects (retrieval, analysis and reasoning, and modeling) of relevance to geosciences and GISci. This chapter provided a brief illustrative review on how mathematical morphology could be applied to deal with varied topics of relevance to mathematical geosciences and geographical information science communities. Reader is encouraged to dig cited references for more details. Our studies show that there exist several open problems of relevance to the mathematical geosciences community. These open problems could be well-handled by mathematical morphology. Some of the recent advances of mathematical morphology and their applications in spatial data segmentation and morphological clustering were discussed. Applications of both classical and modern mathematical morphological transformations in geosciences and GISci are yet to be seen in full-length. It is our hope that most visible and highly distinguished scientists who are active in the IAMG activities would spread a word wide across and would spur the interest of young researchers to take the strides forward.

**Acknowledgements** I would like to gratefully acknowledge Jean Serra and collaborators, mentors, reviewers, examiners, friends, employers, well-wishers, and doctoral students—S.V.L.N. Rao, B.S.P. Rao, M. Venu, K.S.R. Murthy, Gandhi, Srinivas, Radhakrishnan, Lea Tien Tay, Chockalingam, Lim Sin Liang, Teo Lay Lian, Dinesh, Jean Serra, Gabor Korvin, Arthur Cracknell, Deekshatulu, Philippos Pomonis, Peter Atkinson, Hien-Teik Chuah, Laurent Najman, Jean Cousty, Christian Lantuejoul, Alan Tan, VC Koo, Rajesh, Ashok, Pratap, Rajashekhara, Sravan, Aditya, Sampriti, and several others. I am grateful to Frits Agterberg for his phenomenal support and suggestions that made this review chapter readable and understandable.

### **References**


Sagar BSD, Rao BSP (1995d) Ranking of Lakes: Logistic maps. Int J Remote Sens 16(2):368–371


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **Part V Reminiscences**

### **Chapter 36 IAMG: Recollections from the Early Years**

**John Cubitt and Stephen Henley**

John Cubitt and Stephen Henley, with contributions from T. Victor (Vic) Loudon, EHT (Tim) Whitten, John Gower, Daniel (Dan) Merriam, Thomas (Tom) Jones, and Hannes Thiergärtner

This chapter records some of the dramatic history of the first few years of the International Association for Mathematical Geology (IAMG, much later renamed the International Association for Mathematical Geosciences), and its subsequent development told mostly through recollections (both professional and personal) of some of its early members. It complements the paper by Václav Němec in this volume who discusses his own experiences leading up to and following the foundation of the Society.

The IAMG was formed on 22nd August 1968, in a meeting at the International Geological Congress in Prague, Czechoslovakia, attended by 20 scientists from around the world. This followed preparatory work by an ad hoc committee of 14 (not all of whom were able to attend the formation meeting) which formulated statutes and by-laws and proposed names of a first set of officers.

J. Cubitt (✉)

Newhaven, Church Street, Holt, Wrexham LL13 9JP, UK e-mail: johnmcubitt@gmail.com

S. Henley

Resources Computing International Limited, 185 Starkholme Road, Matlock, Derbyshire DE4 5JA, UK

### **36.1 The Birth of Mathematical Geology and the Origins of the IAMG**

### *36.1.1 Vic Loudon*

The comprehensive framework for sharing geological knowledge developed over a long period, in the form of a shared network of scientific books and papers, maps, records, samples, specimens, reports, and guides—including the systematic output of regional and national geological surveys. Geological projects could contribute new information within a framework of existing knowledge and the requirements of publication. This framework, however, did not anticipate the arrival of the computer. In the early 1960's some enthusiasts considered that computers could have an

important role in creating new, widely shared mechanisms for analysing, exchanging and integrating numerical information. But to many geologists at that time, computers were a passing fad—surely the complexity of geological observation and thinking could not be reduced to mathematics, never mind its mechanical representation! Nevertheless, computer programs were shown to handle recurring statistical tasks, even if only in the detail of a geological study. They might also build on the work of others. But that requires communication, a shared objective, and in due course a shared framework.

At that early experimental stage, computer applications in geology were generally rather trivial, overlapping, uncoordinated and unpublishable. They were nevertheless essential to determine which possibilities might be fruitful, and which would be duplication. To help programmers to gain a broader view of similar work elsewhere, an informal 'Geologically Oriented Scheme for Sharing Information on Programming (GOSSIP)' was maintained at Reading University in England. Notes from various workers in geological computing were assembled and typed onto punched cards. These were sorted and revised by hand, the results printed on a typewriter connected to the keypunch, and mailed to the participants. The last of several editions was circulated in 1966 (GOSSIP 1966). It provides an insight into a fast-growing area where many individuals had been exploring possibilities independently, and beginning to develop an initial overview. Apart from one mention of information retrieval, the applications referred exclusively to numerical data. Later, to quote Krumbein (1969): "…on the one hand we observe a growth in the

complexity of programs, and on the other hand a spreading of essentially the same computer techniques through the many subfields of geology… the underlying methodology is so similar in all fields…that most speakers shifted emphasis from standard or conventional techniques to consideration of new and more analytical ways of setting up models applicable to their own fields."

### *36.1.2 John Cubitt and Stephen Henley*

Merriam (1981) put Loudon's comments into a historical perspective by giving a helpful summary of the development of mathematical geology. This shows that the introduction of mathematical methods into the science of geology was very slow, until the advent of computer technology, despite the efforts of such notable scientists as Paolo Frisi, Charles Lyell, Paul Deshayes, Charles Babbage, and Lord Kelvin, as well as statisticians such as Karl Pearson and A.N. Kolmogorov, and others such as R. Everest, chief surveyor for India. It is well-known that the first edition of Lyell'<sup>s</sup> Principles of Geology (1830) included statistical data that he used to justify his subdivision of the Tertiary; however, once the classification was accepted, this statistical scaffolding was not deemed important enough to be retained in subsequent editions.

The earliest consistent efforts towards routine application of quantitative methods in geology were made by A.B. Vistelius from 1941 onwards, while the use of computers was pioneered by W.C. Krumbein starting with a book in 1958 jointly written with L.L. Sloss (Krumbein and Sloss 1958). For the next ten years, there was a steadily increasing number and variety of publications on computational methods applied to geology mostly but not exclusively statistical.

### *36.1.3 Tim Whitten*

Whitten noted that prior to 1968, different approaches to quantitative geology applied around the world. However, at the IAMG formation meeting, dissimilar approaches came together, having evolved principally in the Soviet Union, Western Europe, and U.S.A. Vistelius championed the concept that Mathematical Geology is a separate branch of science based on testing geological hypotheses mathematically, and that this should be IAMG's primary focus (Whitten 2003, <sup>2004</sup> pp. 384–5); for some years, he had contended it is not particularly important merely to manipulate geological data statistically. Dech and Henley (2003, p. 368) noted Vistelius (1991) considered that, if a science does not use mathematical modelling in constructing conclusions, "… it can be considered as belonging to the pre-Newtonian period, … behind the present-day level of research by approximately 300 years."

### **36.2 The Role of the Kansas Geological Survey in the Origins of the IAMG**

### *36.2.1 Tom Jones*

When I got to Northwestern University in 1967, I found several faculty members were quantitative, along with a few students. Krumbein was doing work in several areas at that time, notably geographic forms, Markov chains, and modifications of trend analyses. The Kansas Geological Survey Computer Contributions (KGSCC), spearheaded by Dan Merriam, provided an ongoing source of publications on mathematical geology and associated software.

### *36.2.2 Vic Loudon*

In the late 1960s, Dan Merriam led a pioneering group of geological programmers in the Kansas Geological Survey at the University of Kansas, describing the results of their computing activities in its own publication, the KGSCC. In 1967–8, Richard Reyment spent some time at the Kansas Geological Survey, on sabbatical leave from the University of Uppsala in Sweden. He was another of the prime movers in establishing the IAMG (its first General Secretary and subsequently its President, and in 2002 the recipient of that organization's Commendation). I was privileged to listen to one of their conversations, where they agreed that a formal body was desirable to assist and encourage documentation and communication of these developments.

### *36.2.3 Tim Whitten*

The momentum driving a founding meeting in 1968 really stemmed from the Kansas Survey folk—the main activist there was Dan Merriam, who was very keen on instituting an international society and I imagine it was he who got the meeting included in the IGC programme.

### **36.3 Name and Establishment of the Society**

### *36.3.1 Vic Loudon*

Merriam (perhaps only in the wishful thinking of my biased mind), in the conversation referred to above, seemed to take the view that computer science, rather than mathematics, was the key issue. However, it seemed that the geological establishment at that time might find 'mathematics' more acceptable. Subsequently, Richard Reyment organised an ad hoc committee for the purpose of founding an association for the promotion of mathematical geology.

### *36.3.2 John Gower*

I remember there being discussions on what name to give to the new Society and that somebody had suggested Geometrics echoing the names of the Biometrics and Psychometric Societies. It was noticed that Geometry had forestalled that suggestion so it became Mathematical Geology succeeded by Geomathematics succeeded by Mathematical Geosciences but perhaps geometrics was not so bad an idea as it seemed because originally geometry was about Measuring the Earth. Indeed, the mathematical geologists had nomenclatural problems from the start when, because of the political climate at that time, they could not appoint D.G. Krige from South Africa who would have been the obvious choice, to the Presidency. They made him a Councillor.

### *36.3.3 Dan Merriam*

The 1968 IAMG foundation meeting followed considerable correspondence and fact-finding by the ad hoc committee whose Members were:


This committee formulated a set of statutes and by-laws (largely written by R.A. Reyment in compliance with IUGS and ISI guidelines), made provision for establishing a journal, and proposed a slate of officers.

### **36.4 Foundation of IAMG Publications**

### *36.4.1 Tom Jones*

As time went on, the IAMG formed the journal Computers & Geosciences (C&G). The Kansas Geological Survey Computer Contributions (KGSCC) series was discontinued in 1970, probably in part due to C&G and as a result of Dan Merriam moving to Syracuse University to become Chairman of the Geology Department. The American Association of Petroleum Geologists (AAPG) formed a committee on Computer Applications, but I do not recall that it had much influence. A North American group formed MGUS (Mathematical Geologists of the United States) around the mid 70's with the goal that MGUS would eventually become a regional group tied to IAMG. Much later (I believe 1985) AAPG sponsored a computer-oriented magazine, GEOBYTE.

### *36.4.2 Vic Loudon*

To quote the IAMG website: 'The mission of the International Association for Mathematical Geosciences is to promote, worldwide, the advancement of mathematics, statistics and informatics in the geosciences'. It established a journal and a newsletter. From its inception in 1968, an important role of the IAMG has been publication—initially in its journal Mathematical Geology (now Mathematical Geosciences) which 'publishes original, high-quality, interdisciplinary papers focusing on quantitative methods and studies of the Earth, its natural resources and the environment.'

In 1975, Computers & Geosciences was established as a journal devoted to all aspects of computing in the geosciences. It was published by Elsevier with Merriam as its first Editor-in-Chief, and in due course became another IAMG publication. It publishes research papers on computer methods in the geosciences, such as spatial analysis, geomathematics, modelling, simulation, statistical and artificial intelligence methods, e-geoscience, geoinformatics, geomatics, geocomputation, image analysis, remote sensing and geographical information science.

These journals (including the later IAMG publication Natural Resources Research) filled a growing gap in the maturing area of computer applications, and became an essential part of geological computing. The earlier ad hoc sharing of results and many individually trivial, and therefore unpublishable, exploratory studies had helped to create the basis for their development and their integration. This is relevant now, as communication heads towards another looming gap, described later.

### **36.5 Prague**

### *36.5.1 Dan Merriam*

The organizational meeting of the IAMG took place at the XXIII International Geological Congress (IGC) in Prague's New Technical University, Czechoslovakia, on the 22nd of August 1968. It was attended by 20 representatives from 10 different countries:


### *36.5.2 Tom Jones*

Several Northwestern University faculty members went to the Prague IUGS meeting, but Krumbein and Whitten were the only ones who were associated with the founding of IAMG. Of course, when word came of the Soviet army moving into Prague during the IUGS, everyone at Northwestern University was concerned about safety issues, but no news was available to us. All went well, and they had lots of stories to tell upon their return, along with photos of tanks driving down the street in front of their hotel.

### *36.5.3 Tim Whitten*

I was a founding IAMG Member in Prague in 1968 and, in several papers (Whitten 2003, 2004, 2005), I've alluded to that experience and to Vistelius' participation in the founding.

In many ways, 1968 was an extraordinary year that rocked the world (cf. Kurlansky 2004). Some enthusiasts gathered to create the IAMG in exciting, but tragic, times. Soviet troops had occupied the city on August 21st; guns of encircling Soviet tanks pointed at the University, which was the centre for printing and disseminating news. Vistelius was elected IAMG President and Krumbein 'Past President' (a designation he appreciated and found amusing!); both are fathers of geological modelling methodology.

Opening of the IGC itself was fine but it was immediately followed by the invasion. The founding meeting was therefore brief, hurried, and somewhat stressful because the Americans present were anxious to get away to complete and execute their evacuation plans (being organised by the US Embassy); they soon left Prague. However, despite the fact that I was an official delegate of Northwestern University, the organisers of the US evacuation wouldn't have anything to do with me, because I was on a UK passport. With most other delegates, I continued supporting and attending IGC sessions until, after a couple more days, the Czechs felt it necessary to terminate the Congress (at a very emotional hastily arranged closing ceremony). My friends in the Finnish contingent immediately promised I could evacuate with their party but, in the end, I learned the British Embassy was organising two coaches to drive out to Nuremberg in Bavaria—the route went through Pilsen and the passengers, being British, made the Czech drivers (against their concerns and protests) stop at the Pilsen brewery to have a last tankard apiece thence to Nuremberg and a special BA plane via Amsterdam to Heathrow. the Congress, mainly organised and led by Václav Němec—there always seemed to

I had been on an excellent 14-day field trip right through Czechoslovakia before be an orchestra at dinner, stridently playing Dr. Zhivago, much to the consternation of the several Russian delegates.

Dr. Václav Němec from Prague deserves a word. He played a large role in the Prague IGC. In addition to his Prague home, my wife and I visited his attractive rustic cottage (in the forest someway up to the north) once whilst the country was still Russian occupied. He contributed quite a lot to one part of mathematical geology by regularly organising well-attended conferences at Príbram (not too far SW of Prague) through the 1970s—as appropriate to a mining town, there was quite a focus on mining issues and latterly on geo-ethics; these were loosely connected with IAMG. After truly awful food available during the conferences, he always organised a magnificent closing banquet (always pronounced 'basket')—don'<sup>t</sup> know where he wrestled up the fine food and drink!

### *36.5.4 Dan Merriam*

Modified from an interview with the Lawrence Journal World August 21 2008 with permission of the Merriam family– In August 1968, the Soviet Union's Warsaw Pact allies rolled into the Eastern "

European country with tanks and planes to squash the movement known as the Prague Spring," which sought more political and social freedoms during the Cold War years. Dan Merriam, who sadly died in 2017 after retiring from the Kansas Geological Survey, escaped the country safely on a train to Austria. He recorded his notes in Prague and mailed them back to Lawrence. Merriam lived through a tense time when more than 100 people were killed and Czechoslovakia's Communist Party leader, Alexander Dubcek, was arrested. Dubcek didn't return to Prague until 1989. Just before the invasion, geologists from around the world, including the Soviet Union, were there in August attending a session for the IGC to form a new organization, the International Association for Mathematical Geology.

British colleagues had driven Merriam and Stanford University geologist, John Harbaugh, into Prague for the conference. They were at a hotel in the eastern part of the city when at 2 a.m. on Aug. 21, low-flying airplanes suddenly woke Merriam. "For some reason in my mind, I thought the Russians were coming, but it didn'<sup>t</sup> occur to me that's what was happening," he said.

**Fig. 36.1** Dan Merriam and Trevor Ford (Leicester University, UK) searching for a way out of Prague August 1968. Copied with permission of the Merriam family

The invasion also shocked the native Czechs and even the Soviet delegates who attended the geology conference. On the eastern side of the city, Merriam didn'<sup>t</sup> witness much destruction. His notes from those few days mention an eerie sense of calm in the eastern part of the city, apart from airplanes sweeping in and tanks rolling around. He noted "the tears in the eyes of the waitresses and the little knots of grim" in the neighbourhood along with several protests. Much of their news came from rumours on the street because radio stations had been bombed and the spread of information was spotty. "There wasn't anything they could do. There wasn't anything we could do, either, but just watch and hope nothing happened," Merriam said.

The US Embassy had advised Merriam and his colleagues to stay in the hotel because transport from the city was impossible. Even though several members fled the city, the geological conference continued to meet for one day after the invasion. "

The new group, the International Association for Mathematical Geology, even elected its leadership, including President Andrei B. Vistelius, a geologist from the Soviet Union, while the tanks occupied the city," Merriam said. "anyway," he said.

It had nothing to do with it, but it was kind of an interesting coincidence quently called Harbaugh's wife, Josephine, to see whether there was any word. But

During that week back in Lawrence, Annie Merriam was on edge. She freshe heard nothing. Finally, Dan Merriam and John Harbaugh had a chance to leave Prague (Fig. 36.1) on a train. It left the city even with tanks nearby, he said. As it approached the Austrian border, the lights went out, and soldiers came to check passports. The train eventually stopped in Vienna, where Merriam sent the telegram to his wife. He also mailed home his letter, which didn't arrive in Lawrence until after he returned home the next week. "

It was only a few words but the short telegram Annie Merriam received at her home on Aug. 24, 1968, gave her a huge sense of relief. ARRIVED VIENNA OK = DAN = ." "When that came, we were thrilled," said Annie Merriam. When he did return to

Lawrence, it ended a tense chapter for his family. "Don't you ever go anywhere again," Annie Merriam said about her thoughts upon her husband's return. But he did continue his travels. He even returned to Prague in 1993 for the IAMG's 25th anniversary.

### *36.5.5 Vic Loudon*

A few months after our marriage, my wife and I set out from Reading in southern England in our Morris Minor, heading for the inauguration of IAMG. Apart from a stone-strike on the windscreen and its replacement before we reached the English Channel, the journey seemed uneventful. But odd things happened. Travelling through the beautiful Czech countryside, we were forcibly stopped at a secluded spot by a long, shiny black Mercedes. The driver came menacingly to our window: "Exchange foreign currency now, very good price!" The distraction of a passing truck let us escape. As we approached Prague, we noticed more and more heaps of cobblestones that had been lifted from the road and neatly arranged—road-works so tidy they looked like walls. We had booked a room at the Zlata Husa, now a luxury hotel, but then more mundane. The friendly receptionist, carrying our room key, showed us into a small alcove in the reception area, pressed a button, and the entire alcove, still open to the world, moved gently upwards through the ceiling, becoming an alcove (with us still in it) in the room above. She showed us to our bedroom overlooking the beautiful Wenceslas Square. But why did our door look as though it was cased in sheets of steel?

No matter. The hotel was in the centre of town, convenient for exploring the neighbourhood, which we eagerly proceeded to do. It was a long time ago, and I forget the precise order of events, but well remember enjoying walks through alleys and shops of the Old Town; the impromptu puppet show for our benefit in the back room of a tiny shop; and the crossing of the Vltava River by the ornate Charles Bridge, where the youth of the city were chatting in cheerful groups. On the other side was St Nicolas Church, with Prague Castle beyond.

In our bedroom at the Zlata Husa, about 4 a.m. on the 21st of August, we were wakened by planes flying at rooftop level. Did this happen often? But then it was followed by gunfire outside our window, and explosions nearby. Before dawn broke, the sound of tanks moving into position came from below. The armed forces of the Soviet Union and the Warsaw Pact countries had made their point, and the city was now under their control. A Google search for 'images for Prague spring photography' gives a good impression of the results.

The *Report of the 23rd Session of the International Geological Congress* records on page 20 that "On August 21st, 1968 and in the following days, the work of the Congress was interfered by the entry of foreign armies into Czechoslovakia. In result of the overall uncertainty, the blockage of bridges, tanks around the Congress Headquarters, shooting in the streets and other disturbances, a considerable part of the attending members was prevented to come to the Congress Headquarters or had to leave prematurely." Visiting geologists housed in the suburbs lacked any means of reaching the meeting. Merriam records that the US embassy negotiated a train to the border, by which they were evacuated (Lawrence Journal World 2008). The IGC Report records on pages 200–201: "International Association for

Mathematical Geology (IAMG). This Association was officially founded at Prague on August 22nd, when it held its General Assembly. The following officers were elected: President: A.B. Vistelius (USSR). Vice President: W.C. Krumbein (USA), G.S. Watson (USA), General Secretary: R.A. Reyment (Sweden), Treasurers: V. Němec (CSSR), T.V. Loudon (UK), Ordinary members: E.H.T. Whitten (USA), D. A. Rodionov (USSR), D.G. Krige (S. Africa), G. Matheron (France), F.P. Agterberg (Canada), S.N. Sengupta (India), Editor-in-Chief: D.F. Merriam (USA). The application for affiliation to the IUGS (International Union of Geological Sciences) of this Association was unanimously approved by the Council." And so, the IAMG was created, before a reduced but still quite substantial audience.

While I was attending the meetings, my wife took the opportunity to photograph the interesting happenings in the Old Town. A group of soldiers objected, and indicated that she should hand over her camera. They opened it to spoil the film, and returned it. A round of applause came from the on-lookers, perhaps realising that the film in the Instamatic camera would be unaffected.

A day or two later, when our business and sight-seeing were eventually complete, we felt that we should head for home and our anxious relatives. Getting out of Prague was no problem, returning on the same route as our arrival. But half-way to the border, a bridge across a river was blocked by the military, and the road closed to all. Despondently, we slowly retreated for about a mile, when an old man outside his cottage waved us down. We had no language in common, but looking about nervously he gesticulated towards a farm road a hundred yards away, making rippling movements with his hands, and repeating what sounded like the German word 'wasser'.

Not understanding, but with little to lose, we followed his directions and came to a wide stretch of water. It was the same river, and this might be a ford. While preparing to wade in and find out, a truck came the other way, water just reaching its axles. The ford and the road led us back to our intended route, now beyond the blocked bridge. So that kind man, at considerable risk to himself, had made possible our continued journey. Our fuel was running low, and all garages had been closed. Downhill coasting and gentle use of the accelerator brought us eventually to the border at Rosvadov, with the needle firmly set on Empty. A careful passport inspection, and we were through, greeted by US soldiers—pleasant, friendly and helpful. "There's a gas station just up there. Or ask that guy [pointing to their encampment], he'll fill you up, no charge." We took the first option, and soon were in open countryside. We stopped, got out the car and for a while just stood there together—still, silent, and subdued.

### *36.5.6 Hannes Thiergärtner*

Founding member of the IAMG -

I remember rather well the founding procedure of the IAMG. This event occurred for me as a drama in three unusual acts.

The prelude was to empower me to participate at the congress at all. Let me explain for our younger colleagues that the European world at that time was split into the western and the eastern blocks characterized by extremely different political-social systems and ways of life. I grew up, lived and worked in the former German Democratic Republic (GDR) that belonged to the "eastern world". Here, many, not to say most, things were centralized and provided from the top. Thus, it was nearly impossible to participate individually in an international congress. Participants have been selected, nominated and merged to so-called delegations. I worked in the Central Geological Institute in Berlin, the geological survey of the country, as young graduate in the field of mathematical geology and electronic data processing without international reputation. I never had a chance to be nominated for the IGC. On the other hand, I felt the opportunity to go there because it was the first IGC after World War II held in the Eastern Bloc and restrictions to visit the congress were still distinctly lower than in later years. So, I successfully requested the Director of our institute for vacation and paid the fee and all other requirements out-of-pocket. It was a unique courageous decision, for both the Director and myself. I travelled to the congress, was integrated into the official "delegation" and found accommodation in a student's hostel.

The main act played out in Prague. It was and is a wonderful and pulsating place. The townscape in the late 1960s was still characterized by the post war years, predominantly in greyish colours but nevertheless imposing and unique. A metro net did not yet exist at that time but the town centre was well developed by a dense tramway system. The organizers of the XXIII IGC had chosen for the opening ceremony the auditorium of the Charles University, the Carolinum, in the Prague historic centre—an amazing and venerable baroque hall with a super interior. The ceremony was impressive, indeed, and all participants hoped for a fruitful scientific exchange of ideas within the following days.

All attendees knew about the critical political situation because of the Czechoslovakian trends to reform their political system. My journey to join the congress session "Mathematical geology" passed the ministry of defence during these days. When I started to go in for the lectures on Wednesday and Tuesday (August <sup>21</sup>–22), I had to walk in front of the Ministry between many tanks which had come from the Warsaw pact states and occupied the town. It was shocking! I do like to take photographs but I did not in this moment—it was too serious. The situation was ghastly and the agile Prague was silent.

I reached the session rooms without personal impairment. There I met so many colleagues I never had seen before but knew from the scientific literature, such as Frederik Agterberg, John W. Harbaugh, Vyachelav Kutolin, Victor T. Loudon, Richard B. McCammon, Václav Němec, Richard Reyment, Dmitri Alekseyevich Rodionov, Andrey Borisovich Vistelius or Eric Harold Timothy Whitten. Altogether 20 persons were present. It was simply great for the young fresh geologist from Germany! Regardless of the stressful situation, we founded the International Association for Mathematical Geology. The organisation was well prepared by Richard Reyment and it proceeded to elect its leading officers. I remember that the participants from the Eastern Bloc during a break agreed to vote for Andrei B. Vistelius as first president to ensure parity within the top of the association. very small railway station in the western part of Prague to "enter" one of the now

On the same day, all members of the GDR delegation got orders to meet at a rarely running trains to the German boundary. We left the hosting country in a night and fog action.

### **36.6 Subsequent Events Following Prague**

### *36.6.1 John Cubitt*

As a second-year undergraduate at Leicester University at the time, I was almost unaware of the events of the IAMG foundation. All I can recollect is my tutor, Trevor Ford, and our Department Chairman, Professor Peter Sylvester-Bradley, returning from Prague with tall tales of the various lucky escapes. It must, however, have made some form of subconscious impression on my mind because less than a year later I mentioned to Trevor that I would like to go on to undertake postgraduate work in computer applications in geology. In that case he said, you need to meet someone and marched me out of his office and down the corridors of the Department of Geology. In a minute, we found the mystery person he wanted to introduce to me. He was striding down the corridor in cowboy boots, string tie and cowboy hat in his typical dynamic intimidating style, Dan Merriam. After brief introductions from Trevor, Dan talked about the Research Group at the Kansas Geological Survey and how I should undertake a Ph.D. at Leicester University but with the first year paid for and spent at the KGS. "That will be OK with the Department, won'<sup>t</sup> it?" Dan said to Trevor and whether it was or not, the decision had been taken. Within a few months of whirlwind arrangements, I was on my way to Kansas and my career was underway (Dan subsequently took me to Syracuse University as well so I have much to be grateful to this amazing dynamic organiser for). This frenetic activity was typical of the rapid growth in the subject of mathematical geology and the IAMG at the time.

### *36.6.2 Hannes Thiergärtner*

The after-play led me back to the reality of those times. The founding of a new seminal association within the international geological community was ignored [in the Eastern Block] especially in the governmentally organized surveys. A policy of restriction was introduced step by step. Any contact—to say nothing of an IAMG membership—outside of the Eastern Bloc proved to be impossible and was strictly forbidden. I would however meet the majority of founding members of the IAMG again in 1984 during the XXVII IGC in Moscow where I could take an active part only on special request made by D.A. Rodionov at the GDR ministry of geology. But that is another story.

With the exception of my colleagues in Prague, Leningrad and Moscow, I was unable to renew my contacts to other founding members until after the German reunion (1990). Frits Agterberg was the first colleague I met in Potsdam (Germany). It was also 1990 and I could then renew my membership in the International Association for Mathematical Geosciences. I think we all have utilized this late time as well as possible to solve some common questions in our interesting field of science.

### *36.6.3 Stephen Henley*

As a humble Nottingham University postgraduate student in 1968, I wasn't at the IGC or the Prague launch of IAMG. However, I was deeply involved in computer applications and statistical analysis, processing what then seemed like huge volumes of data from the X-ray fluorescence spectrometer, and then making sense of the data using esoteric methods such as factor analysis, cluster analysis, and trend surface analysis. Under the mentoring eye of Peter Harvey, I joined IAMG as soon as I heard of its existence, in 1969—and have remained a member without a break since then. It is fair to say that mathematical geology shaped my entire career. As my Ph.D. studies came to an end in 1970, an opportunity arose in Australia.

The Bureau of Mineral Resources (now Geoscience Australia) suffered a mass resignation of several dozen geologists who left to join one of the periodic mining booms—this one in Western Australia, sparked by discoveries of major nickel deposits. Among those who left was their one computing 'expert', so my meagre computing experience was sufficient to gain me a position in Canberra, where I gained a broad experience of mathematical modelling and statistics in fields that included hydrogeology, exploration geochemistry, earth tides, and global scale geochemical modelling of Archaean evolution of the Earth (this last with Andrew Glikson, based on studies of some of the world's oldest rocks). After my return to the UK, I finally accepted my type-casting as a computer geologist and in 1973 joined the Computer Unit of the Institute of Geological Sciences (now the British Geological Survey). This small specialist unit occupied two rooms on the top floor of the Geological Museum in London, and had an IBM 1130 computer—which even then was of very limited capacity. However, we also had access to the much more powerful mainframe IBM 360/195 at the Atlas Computer Laboratory (ACL) in Oxfordshire.

The head of the Computer Unit was Dr T. Victor (Vic) Loudon who had pioneered generalised software development in his previous academic work at Reading University (the Rokdoc package) and was one of the founding members of IAMG. Rokdoc was the inspiration for a colleague Keith Jeffery to start the development of a general-purpose geological data handling system 'G-EXEC' which was built around the recently published ideas of IBM researcher Edgar Codd for relational database management. When I first met them, Keith and his co-worker Elizabeth Gill at ACL, were preparing an early version of G-EXEC: I walked into the office they were using to see the floor strewn with many piles of punched cards and reams of fan-folded lineprinter listings of the software. The whiteboard displayed a beautifully simple diagram of the system structure, and I was hooked.


**Table 36.1** Officers and Council of IAMG

a served as Vice President **Fig. 36.2** Official logo of IAMG

providing computing services to a wide range of users within IGS as well as supplying the software to other institutes in the Natural Environment Research Council and worldwide. The IGS Computer Unit itself was a research centre in its own right: John and I both worked together on the potential use of catastrophe theory as a geoscience modelling tool, though we were ahead of the times, and it was only when catastrophe theory was superseded by chaos theory that the potential became reality, in such fields as climatology and oceanography. Working with Jeff <sup>O</sup>'Leary, then at Leicester University, I also used the relatively new field of geostatistics in developing a 3D model of the Jwaneng diamond pipe in Botswana, but misgivings about the method, arising from that and other projects, led to development of more robust 'nonparametric' methods which formed the basis of a book (and led to my receiving the 1982 President's Award of IAMG). sequently incorporated into other products including, in my case, the 'Datamine'

The underlying G-EXEC concepts (and much of the software itself) were submining software system. The rest, as they say, is history.

### *36.6.4 Dan Merriam*

(From Merriam 1978, copied by permission of the Merriam family)–

A list of Officers and Council members of the Association is given in Table 36.1. During the first year a call for members was made. A logo was designed according to specifications of D.F. Merriam by Charles Barksdale of the Kansas Geological Survey for use in connection with official Association business (Fig. 36.2). This logo was used on a certificate received by all charter members (those who joined during the first year). negotiations were complete with Plenum Press for a new journal, Journal of Mathematical Geology (JMG), which appeared first in 1969. It was made a quarterly in 1970 and a bimonthly in 1975. Also in 1975 the quarterly journal, Computers & Geosciences (C&G) was established with Pergamon Press the publisher.

The JMG focusses on geomathematics and mathematical geology, which includes geological arguments supported by numerical observations to purely mathematical models implemented with geological data. C&G is devoted to the rapid publication of computer programs of interest to earth scientists in widely used languages and their applications. A quarterly Newsletter contains general information of interest to members.

Each year the Association sponsors meetings, many in cooperation with other organisations. For example, IAMG cohosts the Geochautauqua held each year at Syracuse University and every other year a session in mathematical geology at the Pribram Mining Congress. At each IGC since Prague, we have sponsored or cosponsored several sessions of interest to our members. In addition, we have cohosted sessions at meetings of the American Association of Petroleum Geologists, and the Geological Information Society of the Geological Society of London. Proceedings for many of these meetings have been published either as special issues of the Journals or as hard-back books.

Seven national groups have been created and are functional. They are in the United States, Canada, Brazil, Great Britain, Czechoslovakia, Hungary, and Russia; others are in the formation stages. These national groups are active in disseminating information on geomathematics on a national level. Although national groups are autonomous, they are expected to coordinate their activities with the Association.

Operation of the Association is mainly through committees. The Project Committee is responsible for preparing the meetings at the next IGC which is held every four years. The Membership Committee is concerned with soliciting new members; the Finance Committee with soliciting money; and the Educational Committee with organizing material and activities to promote geomathematics. Each year a


**Table 36.2** IAMG committee chairmen

**Fig. 36.3** Design for Krumbein medal

committee (chaired by the President) selects the William Christian Krumbein medallist and another special committee selects the Best Paper for an award.

A special committee has undertaken the task of compiling a list of all computer-aided instruction (CAI) programs available and of interest to geologists and it will be distributed in the near future. There are also plans for compiling a list of computer software, the list will contain information on programs and their availability and limitations. Chairmen of the various committees are given in Table 36.2.

The Association maintains close contact with other organizations which share similar interests. For example, several members of the Association serve on the IUGS-sponsored COGEODATA Committee. Others are working on special projects for CODATA. The Association has a member on Scientific Committee 4 which evaluates quantitative aspects of projects for the IGCP. Liaison is maintained with the International Paleontological Association.

The William Christian Krumbein Medal is presented each year by the Association to an outstanding geomathematician. The first recipient was Professor John C. Griffiths of Pennsylvania State University, the second, Professor Walther Schwarzacher of Queen's University, Belfast, Northern Ireland, and the third, Dr. Frederik P. Agterberg of the Geological Survey of Canada, Ottawa. The recipient receives a medal with the likeness of William C. Krumbein on one side and the Association's logo on the other. The Medal was designed in 1977 by A. Pattison, sculptor of Florence, Italy and Winnetka, Illinois (Fig. 36.3).

The IAMG, in its short period of existence, has participated in and contributed to changes in the earth sciences. In the future the Association should play an even larger role in development of the science.

### **36.7 The Looming Gap**

### *36.7.1 Vic Loudon*

The methodology of geological investigation and communication was initially formalised within the constraints imposed by the traditional mechanisms of pen, paper, typewriter, printing press, bookshops and libraries. It has been extended by computer techniques, formalised in a framework set by the manufacturers and providers of computer equipment and software, but is still based on and restricted by geological traditions, conventions and precedents. Geological surveys continue to provide geological maps world-wide, with defined scales of presentation, uniform stratigraphical classifications, and separate volumes of text, with cross-references to locations on the map.

These products provide a stable underlying shared basis for subsequent geological investigations, essential for accurate communication, including a consistent and coherent structure within which new investigations can build. This is achieved by results being confined within the rigid framework and slow-moving processes of conventional publication. Geological knowledge can potentially build on a wider framework, going far beyond its current traditions, conventions, limitations and precedents.

The global information structure is being remodelled, based on new technology with unfamiliar implications. Current developments in computer translation, voice recognition and speech synthesis point to a much more flexible future.

As in the mid-1960s, a significant gap may be developing between the future of geological communication and its current implementation of published papers and maps. Experimental initiatives might be a good starting point. Their results might be inappropriate for traditional patterns of communication, but information on their development could usefully be exchanged in an open and flexible forum, for which IAMG might be a suitable host.

#### **Appendix** '

A readable account in the Economist (2017) describes the power of deep learning: an artificial intelligence technique in which a software system is trained using millions of examples, usually culled from the internet… Computers are, in short, getting much better at handling natural language in all its forms.' But (p. 11): 'Scientists do not know how the human brain draws on so many different kinds of knowledge at the same time. Programming a machine to replicate that feat is very much a work in progress.'

The conventional forms of scientific papers and the fixed scales of geological maps reflect the limitations and conventions of earlier technologies. Future development of our understanding of global geology can only be achieved through a multitude of investigations and experimental studies. Many geological developments will be based on local knowledge and requirements. Many will be too trivial for conventional publication but valuable in their own local context. Already, the computer technology for sharing detailed studies and strategies is well established. It could help to provide the essential background for a more comprehensive framework. It could lead to deeper evaluation and integration of data, text, graphic and cartographic information at all relevant levels of detail; rapid and appropriate response to input of new information; the routine calculation, depiction and quantitative assessment of multiple geological hypotheses; and the emergence of a never-ending dialogue between human input and computer implementation, supported by a multi-media interface for input and output.

This calls for developments that go far beyond the precedents and traditions of our established conventions, into an environment for geological information where users are motivated to carry forward an accessible shared understanding. Maps, data, illustrations, simulations, text explanations and scientific papers need not be separate entities nor restricted to a single scale. Input of new information can be rapid, with continual assessment and reassessment of its validity and relevance, and examination of its consistency with previous work.

### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. The images or other third party material in this chapter are included in the chapter's Creative

Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **Chapter 37 Forward and Inverse Models Over 70 Years**

**E. H. Timothy Whitten**

**Abstract** The transition over 70 years from qualitative rock description to attempted quantitative description of rocks and rock bodies (inverse modelling) and testing of process models with observation data (forward models) are outlined. Dramatic increases of readily measured variables, combined with almost unlimited computing power, yielded a plethora of varied inverse models, but limited attention has been given to critical sampling, variance, closure, 'black swan', and nonlinear issues; recent approaches to closure problems hold promise. Especially for plutonic rocks, paucity of quantitative process modelling left exciting forward-modelling opportunities neglected. Resulting challenges ahead are anticipated.

**Keywords** Sampling ⋅ Variance ⋅ Composition variability ⋅ Black swans Granite composition

### **37.1 Birth of IAMG in 1968**

In many different ways, 1968 was an extraordinary year that rocked the world (cf., Kurlansky 2004). Some 20 enthusiasts gathered at the XXIII International Geological Congress in Prague's New Technical University, Czechoslovakia, to create the International Association for Mathematical Geology in exciting, but tragic, times. Soviet troops had occupied the city a couple of days previously; guns of encircling Soviet tanks pointed at the university, which was the centre for printing and disseminating news. Vistelius was elected first IAMG President and Krumbein 'Past President' (a designation he appreciated and found amusing!); both are fathers of geological models.

© The Author(s) 2018 B. S. Daya Sagar et al. (eds.), *Handbook of Mathematical Geosciences*, https://doi.org/10.1007/978-3-319-78999-6\_37

E. H. T. Whitten (✉)

Riverside, Widecombe-in-the-Moor, Devon TQ13 7TF, UK e-mail: timwhitten@btinternet.com

At that meeting, dissimilar approaches came together, having evolved principally in the Soviet Union, Western Europe, and U.S.A. Vistelius championed the concept that Mathematical Geology is a separate branch of science based on testing geological hypotheses mathematically, and that this should be IAMG's primary focus (Whitten 2003; 2004, p. 384–5); for some years, he had contended it is not particularly important merely to manipulate geological data statistically. Dech and Henley (2003, p. 368) noted Vistelius (1991) considered that, if a science does not use mathematical modelling in constructing conclusions, "… it can be considered as belonging to the pre-Newtonian period, …. behind the present-day level of research by approximately 300 years."

### **37.2 In the Beginning (One Pre-1968 Experience)**

Specializing in petrology in 1948, Hatch and Wells (1937) was my 'bible'. That descriptive, natural-history type, foundation meant it was thrilling in 1950 to visit Jacupiranga, the Brazilian jacupirangite type locality. For a Ph.D. project in 1948, it was recommended I look at 260 km<sup>2</sup> of coastal NW Ireland to see what is there; seventy years later, an unlikely method of identifying a thesis project. The area is red (granite) on the Geological Survey of Ireland 1:63,360 map (Hull et al. 1889).

A plan to record variability of granite across the area (including numerous islands in the Atlantic Ocean) was needed. Immediate problems in 1949 were devising (i) a scheme to collect representative samples, and (ii) realistic measurements (measurable in the field or laboratory) to reflect variability.

Unscientifically, a one-mile grid was oriented to maximize (by eye) grid nodes over outcrops (i.e., islands in the ocean and less peat-bog and drift-covered mainland areas). It was planned to collect samples (with hammer and chisel) at all nodes if possible. In the field, two compromises became necessary—using the nearest outcrop to nodes and accepting any hand-sample that could be hammered off.

Wet chemical analysis of numerous samples was beyond available resources; X-ray fluorescence analysis was then undeveloped. Point counting thin sections to determine mineral volume percentages with a Dollar (1937) mechanical stage was feasible, provided larger thin Sections. (3.3 × 2.3 cm) could be hand ground and stained with sodium cobaltinitrite—both challenging in 1949; this staining technique was described by Chayes (1952). Using a Chayes (1949) electricallycontrolled stage improved point-counting accuracy. Studies of spacing and required number of counts (Chayes and Fairbairn 1951; Chayes 1954) suggested sufficiently large thin sections were being used. Manual contours for modal variables (e.g., K-feldspar volume percentage, colour index) at 44 grid nodes reflected considerable areal variation (Whitten 1957). Such contours were very controversial because they crossed ocean between islands and superficial deposits on land; also, no exposures occur in numerous grid squares. A senior reviewer deemed it impossible to draw contours across ocean (despite greater outcrop density with off-shore islands than on land with peat bogs, farming, etc.).

In 1958, I became a colleague at Northwestern University of W. C. Krumbein, who was pioneering quantitative description of sedimentary rocks. The University acquired an IBM360 mainframe computer; we used punch cards and wrote FOR-TRAN programs for statistical descriptors and surface-fitting algorithms for areally-distributed data (e.g., Whitten 1960, 1961). Analogous approaches began thriving at Kansas Geological Survey, Pennsylvania State University, etc. Krumbein developed the concept of descriptive, conceptual, and predictive models (Krumbein 1963; Krumbein and Graybill 1965, p. 13, *et seq*.; Whitten 1964). Driving to Leningrad to spend time at Vistelius' Institute for Mathematical Geology was a privilege in 1971.

### **37.3 Inverse and Forward Geology Problems**

Vistelius (e.g., Vistelius 1977) differentiated *inverse* from *forward* problems. The objective with the former was describing the nature and variability of specified rocks, etc.; that is, with statistical or other techniques, formulating descriptive and/ or genetic models for essentially arbitrary data for arbitrary variables. With forward problems, the objective was testing validity of genetic models (based on currently available information) for rocks, fold belts, etc. That is, testing whether a genetic model is supported or rejected by data for variables dictated by that model; many commonly measured variables are likely to be irrelevant for such testing (cf., Whitten 2005).

For sedimentary and metamorphic rocks inverse and forward problems present fewer difficulties. Thus, 'marine beach' can be defined descriptively by physical, chemical, and biological features that commonly enable marine-beach deposits to be recognised (e.g., in the stratigraphic column), or genetically by environmental conditions that result in beach formation (waves, currents, sediment transport, etc.). Similarly, as Bayly (1968) pointed out, metamorphic facies can be defined by presumed temperature and pressure during genesis (Eskola 1915, p. 114; Turner and Verhoogen 1951) or descriptively by diagnostic mineral assemblages (Fyfe et al. 1958). With igneous rocks (especially plutonic assemblages), geotectonics, etc., inter-relationships between the descriptive and genetic are commonly very debateable (Whitten et al. 1987a, p. 334).

### **37.4 Forward Models in Earth Sciences**

Forward modelling is in its infancy and rare because, in most cases, little objective quantitative information is available about genetic factors, especially for plutonic rocks. Unlike many scientific fields, most earth-science domains do not permit *reproducible* experiment and testing. Vistelius (1972) used Tuttle and Bowen's (1958) experimental petrology to illustrate forward modelling of 'ideal granite', extending his method<sup>1</sup> to Omsukchan Granite, SE Asia (Vistelius and Romanova 1972), Malsburg Granite, Germany (Choubert and Vistelius 1972), etc.

Over the past decade, numerous "forward models" appeared in geophysical studies (petroleum, mining, water, volcanic activity) for prediction and extrapolation based on measured variables (e.g., Geol Soc Amer Symposium 2002; Sui et al. 2012; Butler and Zhang 2016). Butler and Sinha (2012, p. 168) stated such forward modelling is useful for interpreting data. McInerney et al. (2007) compared gravity data computed for a 3D geological model with new Bouguer data to iteratively improve their geological model, calling this forward modelling. Comparable usage occurs in biology (e.g., Tolwinski-Ward 2012). In such studies, inverse models have been honed with new data for sundry variables, producing improved inverse models (cf., iterative forward modelling, Schlumberger Limited 2016). However, such "forward modelling", albeit useful, is wholly different from testing genetic models with new variables prescribed by those models. Different distinctive terminology would prevent confusion.

Vistelius' forward-model definition is retained in this paper.

### **37.5 Inverse Models in Earth Sciences**

Inverse–models reach into many earth-science domains. Manual contours for variability of Donegal granite modes (Whitten 1957) represented an inverse-problem approach; more-sophisticated inverse models followed as computing power facilitated trend-surface map preparation (e.g., Whitten 1960). Computing power soon resulted in every available data set being processed by every available statistical artifice, to explore whether anything interesting (and publishable) emerged. Such research provoked Vistelius' strident remarks at the IAMG founding meeting.

Inverse problems fall into two categories:

	- (i) useful features (e.g., gold content and location; subsurface sedimentary rock permeability variation) as with kriging and so-called 'geostatistics'

<sup>1</sup> Numerous papers by Vistelius and coworkers used the important and challenging discovery that grain transitions along linear traverses of many granitic rocks possess the Markov property, to suggest testing or erecting genetic crystallization models can be based on grain-transition probabilities. However, Whitten and Dacey (1975) and Whitten et al. (1975) demonstrated Markov chains in actual mineral sequences in varied rocks (including a calc-silicate granulite) is insufficient for establishing validity of the granite crystallization model.

(cf., Krige 1964; David 1977; Journel and Huijbregts 1978), or flooding or other risks (e.g., Burke et al. 2016), or

(ii) petrogenetic processes (e.g., infra-crustal origins of *I*- and *S*-type granites within orogenic belts (e.g., White and Chappell 1983; Chappell 1984; Chappell and Stephens 1988).

Speculation about petrogenetic processes that produced described rock assemblages has always been common. Over a thousand high-quality chemical analyses of major and many trace elements for southeast Australian granites led to partitioning samples into *I*-type or *S*-type granitoids with dissimilar sub-crustal origins, and to the restite genetic model (e.g., Chappell et al. 1988, 1987; Chappell and Stephens 1988). Analogous methods were used elsewhere (e.g., North American Peninsula Ranges, Silver and Chappell 1987). Such inverse models could afford excellent forward-modelling bases, if prescribing new variables with which to support or negate the supposed genetic model/s.

However, such inverse models are fraught with difficulties (Whitten 1991, p. 121). Use of different variable sets from Chappell and colleagues' chemical analyses can partition samples into an almost infinite set of descriptive suites. It is unrealistic to enunciate genetic scenarios for one set of descriptive suites, without concomitantly embracing all other coexisting sets defined by using different variables, sets of variables, variable weightings, etc. (Whitten et al. 1987a, p. 341; 1987b). Again, if techniques like cluster analysis were used to partition hundreds of samples on the basis of 36 chemical variables, normalization (to give each variable equal weight) would commonly be used, despite no a priori reason for each element being equally important. Different clusters emerge if one (or more) variable receives different weighting, and when more or less variables are included (Whitten et al. 1987b, p. 69; Whitten 1991, p. 121). Also, standard cluster analysis (and similar partitioning techniques) yield questionable results when percentage and/or parts-per-million data are used (cf., Aitchison 1986, p. 300).

However, where components are conserved throughout crystallisation within certain basic igneous rocks, molar ratios with a common constant denominator were shown to display, accurately and unequivocally, the actual chemical variability (e.g., Nicholls 1988; Stanley and Russell 1989). Molar-ratio diagrams for some Australian *I*- and *S*suites seem to show chemical variations accurately, permitting quantitative objective testing of, say, the restite model (Whitten 1996). This technique for avoiding daunting closed-data problems deserves further examination, although, for many granites, lack of component conservation during crystallization may introduce difficulties.

### **37.6 The Samples Analysed**

Statistical or mathematical analyses of available data are the relatively easy part. Statistical manipulation (inverse modelling) describes characteristics and variation of particular data, but *not* necessarily characteristics and variation of those variables in the rock samples from which the data were derived (or necessarily of variables of *petrogenetic* significance for forward modelling, or of direct *economic* importance).

Data come from samples (or geophysically-sampled rocks, etc.). It is important to assess how well available samples represent the *sampled population* of interest, and whether that sampled population permits realistic extrapolation to the *target population* of primary interest (cf. Whitten 1961). For example, where the objective is determining compositional variation of a pluton, the exposed surface is an arbitrary 2D section (or modestly 3D in mountainous terrain) through the original 3D mass, much of which is eroded away. Soil, vegetation, etc. always obscure major parts of 2D exposures; actual outcrops are disposed arbitrarily or preferentially, but not randomly. Analyses of those samples actually examined (samples collected from sampled outcrops) are necessarily used to estimate composition and variability of the sampled population, and subsequently the target population.

The significance of actual observed dependent data was reviewed by Whitten (2000, pp. 4 *et seq*.) who asserted that, in favourable circumstances, rigorous statistical inferences can be drawn about the sampled population on the basis of samples examined, and subsequently geologists can only use such inferences to make *subject*-*matter inferences* about the target population on the basis of previous geological experience (cf., Cochran et al. 1954, p. 19).

Unusually, such issues can be obvious. For example, road cuttings might expose significantly banded or layered rocks, but only some of those bands may be exposed in outcrops across neighbouring areas.

Serial thin sections from coarse-grained granite samples commonly yield modal values with considerable variance. Exposed igneous rocks may be porphyritic making collectable, representative, samples difficult to obtain. Commonly, samples of dissimilar size are required to estimate composition and variability of each variable. For variables measurable only by laboratory analyses (e.g., modal zircon percentage, trace-element weight percentages), an adequate sampling plan can be devised only following estimating the level of variance of each variable from analytical results. The classical example is Krumbein and Slack's (1956) determination that variance of their variable of interest within a black shale over many square kilometres of Illinois, USA, is greatest at their smallest level of sampling (thin-section level). Different rock types require dissimilar strategies (e.g., determining calcite volume percentage throughout a cratonic limestone requires a less-dense sampling plan than, say, assaying gold weight percentage within subsurface Witwatersrand conglomerates or apatite volume percentage in a granite).

For Rattlesnake Mountain Pluton, California (USA), Baird and Welday (1967) showed that, when variance of attributes is large at their smallest sampling level (hand-specimen level), adjacent samples yield dissimilar values and thus dissimilar areal-variability maps. For their monumental studies of Lachlan fold belt granitoids, Australia, Chappell and colleagues powdered very large samples (over a kilogram) from the mainly visually-homogeneous outcrops, with the intention of minimising major and trace-element variance at the sample level (e.g., White et al. 1977; Chappell 1978). Their sample size and reproducibility of their chemical analyses yielded reliable data. In many regions, they collected a sample from virtually every outcrop protruding through arid rolling pasture. Areas between widely scattered outcrops (sometimes a kilometre apart) were necessarily un-sampled and unknown; it is appropriate to question whether extant outcrops exist because composed of rocks less susceptible to weathering (compositionally dissimilar to the majority).

Generalising, each variable commonly has dissimilar variance in samples of a specified size. Variance tends to be large between small samples, especially when grain size is large, and, as sample size increases, variance between samples decreases to a minimum, before increasing again for extremely large samples (cf, Whitten 1968; 2000, p. 6).

Such issues have long been recognized in mining exploration. Moving-average methods, developed by Krige (e.g., 1964) for South African gold-bearing conglomerates were extended and explicitly controlled (in what is known as 'geostatistics') by levels of variance of variable/s, as expressed by semi-variograms (e.g., David 1977; Journel and Huijbregts 1978); observed large outlier values are accommodated within the 'nugget' effect. 'Nugget' aptly reflects very sparse, larger gold particles within the conglomerates, which affect predicted profitability of subsequent mining; nuggets are represented only occasionally in actual samples and resulting assay values (Whitten 2010, p. 250).

It is not uncommon for it to be assumed that, provided sampling has been 'adequate', variables of interest follow standard frequency distributions (normal, lognormal, etc.). Many common statistical algorithms assume input data are normally distributed; frequently, packaged computer programs normalise input data automatically (often with unspecified algorithms) prior to effecting statistical analyses. However, different normalisation algorithms can produce dissimilar resulting analyses.

### **37.7 The Black Swan Effect**

Throughout the earth sciences, sporadic sample measurements are wholly dissimilar to those for the majority of samples. Not infrequently, analyses lying on the extreme wings of distribution curves (normal, lognormal, etc.), or beyond the tails, are discarded; although such analyses might be attributable to analytical error, many are likely to be real and *very* meaningful. In studying the influence of the improbable in the earth sciences, Whitten (2010) demonstrated that real, localised, anomalous data can reflect features of significant genetic and/or economic importance; the 'black swan' effect (cf., Taleb 2007). That is, such data can reflect important factors not previously considered in models and theories—factors that, after recognition, are likely to be found highly significant.

Throughout geological time, all manner of events occurred that appear to be wholly arbitrary with respect to formation of lithology, structure, palaeontology, etc., of rock units. Impact of a meteor with the Earth is a good example, because it can apparently affect substantially both current organic evolutionary patterns and ongoing physical processes (e.g., sedimentation). Consequently, some, but not necessarily all, dependent variables (with respect to space and time) might show anomalies reflected as outliers on a distribution curve (a nugget-like effect). Such phenomena reflect the operation of customary physico-chemical laws and the effects of irreducible elements of chance and indeterminism (Whitten 2010, pp. 250–1).

The traditional search for order and simplified description commonly deflects attention from important real black swans that require inclusion for realistic understanding of geological phenomena and natural hazards. Mandelbrot (1982) provided a beautiful introduction to fractal geometry in nature; more recently, fractal, chaos, and nonlinear approaches have helped expose basic characteristics of the physical world, whose fundamental significance throughout the earth sciences is rapidly becoming more clear. A report (Lovejoy et al. 2009) on 'geocomplexity' summarized the importance of nonlinear geophysical methods in elucidating rational bases for statistics and models of natural systems (including hazards), which previously were treated by ad hoc methods. That report reflected 15 authors' research ranging from earthquake dynamics, river-flood prediction, basalt columnar-joint formation, coastline topography, meteorological cloud models, and interaction of greenhouse gases and global warming. It concluded with a warning against (a) reliance on traditional state-of-the-art statistical techniques (and theories based on them) and (b) ignoring nonlinear methods which are often helpful for more-complete understanding of the natural world.

### **37.8 Concluding Thoughts**

Throughout most geological domains, the qualitative-to-quantitative revolution via mathematical geology over the past half century has been awesome, made possible by numerical models and readily available data for greatly increased numbers of variables; all facilitated by hugely increased computing power. Investigations extend to variables whose variance cannot be estimated by eye (e.g., isotope ratios; electrical resistivity). The research is manifest in both IAMG Journals and other new approaches (e.g., 3-D visual digital models and virtual presentation of rocks and geological formations, De Paor 2016). Cataloguing, classifying, description, and presentation are often the useful goals, especially for economic geologists (e.g., oil-field research; kriging and 'geostatistics').

Pragmatic review emphasises that many basic (but apparently unexciting) problems enumerated five decades ago (e.g, variance; sampling), critical in inverse models for correctly portraying rock formations (rather than merely assembling data obtained from the rocks), have continued to receive little attention (Whitten 2003).

Birth, maturity, and old age characterise phases of all human endeavour. The past 50 years witnessed birth of IAMG and spreading of its influence throughout the earth sciences using inverse methods, but only initial recognition of the compelling importance of modelling forward problems (in Vistelius' meaning). Inverse-problem studies will move into maturity as variance, sampling, and non-linear models underpin on-going research.

The challenging needs and goals of forward problems are reasonably obvious, but the complex issues involved have been addressed only occasionally (e.g., Vistelius and Romanova 1972; Maslov 2003). Commonly, forward problems will require non-linear process models (i.e., quantitative genetic models) that specify those variables required to test the hypothesis. The next 50 years await research towards that maturity in forward modelling. So-called forward models of recent geophysical studies must not obscure this challenge.

### **References**

Aitchison J (1986) The statistical analysis of compositional data. Chapman and Hall, London


Russian]. In: Vistelius AB (ed) Ideal granites – issue I. Acad Sci, Nauka Press, Leningrad Lab Math Geol, pp 4–47


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **Chapter 38 From Individual Personal Contacts <sup>1962</sup>–1968 to My 50 Years of Service**

**Václav Němec**

**Abstract** The author's initial personal random contacts with pioneers in introducing mathematics and computers to geology in Russia, USA and France evolved thanks to the 23rd International Geological Congress and the foundation of the IAMG in Prague 1968. An incredibly large set of colleagues from all over the world have continuously contributed to a long series of regular international sessions at the Mining Příbram Symposia—a unique East–West gateway for the IAMG during the period 1968–1989. Very intensive work has been continuing until 2000 with several new peaks. The author has used many positive international organizational experiences from the work for the IAMG in developing geoethics, where many experts of mathematical geology have brought a considerable contribution to this new field.

**Keywords** Mathematical geology ⋅ IAMG history ⋅ East–West contacts Mining Příbram Symposia ⋅ Geoethics

### **38.1 Introduction**

My way into geology did not follow an easy direct path. In 1951 my studies of economics (including courses of mathematics and statistics) were stopped because of political reasons I was not admitted to the final 4th year. Instead of my studies I spent the following 26 months in special army units for politically unreliable persons working in the coal mines of the Ostrava region. At the end of 1953 I started to work in a state enterprise for geology of industrial minerals. At that time, this was a Cinderella among the other sectors of uranium, coal or metals deposits. My chief

Founding and life IAMG member, Eastern Treasurer (1968–1980, 1984–1996), Krumbein medallist 1991, Prague, Czech Republic.

V. Němec (✉)

K rybníčkům 17, 100 00 Praha 10 - Strašnice, Czech Republic e-mail: nemec.geo@seznam.cz

appointed me as an assistant to two associate professors of the Charles University who were engaged by our enterprise because of a lack of our own graduated geologists. Both these men later became known as very famous professors: Zdeněk Pouba in economic geology and Zdeněk Špinar in palaeontology. I remained in friendly contact with both of them for the rest of their lives. In 1954 I was able to start distance university studies in applied geophysics (in order to study geology there was a condition of having several years of practice, whereas for my specialization it was only necessary to have finished military service—such was the life in those days). Mathematics was among the key disciplines of my studies and my regular work in my enterprise became more and more focussed on evaluating the results of geological projects concerned with computing ore reserves. I graduated in 1959.

At that time, our Cinderella was incorporated into a new enterprise covering exploration of all sorts of deposits except uranium. Despite some renewed political problems in 1960, I was appointed as chief of a special division for controlling the final reports of the company, being the only trained specialist when two of my new bosses arriving from other sectors preferred employment outside our company. On my own initiative I took my job as a consultant service discussing with my colleagues responsible for individual projects the appropriate methods for computing the reserves. Already in 1961, we started processes with the mechanization of work using punch cards. During a tourist trip to the USSR in 1962 I had my first occasional contacts with several colleagues in Moscow at the State Commission of Ore Reserves—I. D. Kogan was one of the top personalities (his son Robert later became my close friend). After a new reorganization in late 1962 I got a position in which it was possible to realize along with trained computer specialists new ways of applying computers for our specific professional needs.

In 1964—during my first trip behind the Iron Curtain after the Prague coup in February 1948—on a private family visit in the USA I had the chance to contact several colleagues in Colorado and Arizona working in the field of mathematical geology. The existence of the Tucson centre active in this field was discovered from literature by my colleague—economist and statistician *Blahomil Soukup*. My contacts with the organizers of the APCOM Symposia at that time held in Tucson and other US universities resulted in further interesting contacts. At the Colorado School of Mines *R. F. Hewlett* gave me the address of *Ivan P. Sharapov*. The following year (1965) this Russian scientist took a more than 2000 km long flight from Perm to Sochi in order to meet me in person for one weekend during my vacation in that famous Black Sea resort. Ivan was a man who despite incredible personal political problems (several years of arrest and concentration camps) continued to introduce mathematical statistics for applications in geology. He was extremely pleased to meet a colleague from abroad for the first time in his life in his 58th year. He had already established his own written contacts abroad and I obtained from him the addresses of such famous personalities as *Danie G. Krige* and *Georges Matheron*.

In 1965 I was among three Czech authors who published their papers at the APCOM Symposium in Tucson which in 1966 gave an impulse to *Dan Merriam* to contact us in the course of his visit to Europe including the Eastern territory (Krakow and Prague). Further progress in establishing new international contacts became extremely rapid and the approaching 23rd International Geological Congress in Prague (1968) brought me several special engagements among the organizers of the Congress as well as membership of the International Preparatory Committee (headed by *R. A. Reyment*) for the foundation of an international association for the application of mathematical methods and computers in geology (the exact name was under discussion).

In September 1967 during a private tourist trip to France I established personal contacts with Professor *Georges Matheron* and with several other French colleagues (*A. Carlier, Jean Serra*). In November 1967 I defended my doctoral thesis (RNDr.) at the Charles University in Prague in the field of economic and mathematical geology based on my first computerized model of three deposits for a cement factory in the suburbs of Prague.

In December 1967 I was the only foreign guest at the Second Siberian Symposium on Mathematical Methods in Geology and Geophysics in Novosibirsk (480 participants) where *Ivan Sharapov* and the local chief organizer *Yuri Voronin* helped me to contact many VIPs in this field from all parts of the USSR (including *Dmitry Rodionov*). When addressing the plenary meeting I invited people to attend the Prague Congress with a specialized session on mathematical geology and informed them about our plans to found a new international association. (*A. B. Vistelius* was the only member from the USSR on the international committee but he did not attend this Symposium).

### **38.2 IAMG Foundation (Prague 1968)**

In 1968 an incredible optimism characterized both the hopeful political development of the Prague Spring as well as the preparations of the International Geological Congress and of the founding meeting of the IAMG. I already had the pleasure to describe more details of these events in the book for the IAMG Silver Jubilee (Němec 1993a).

The euphoric start of the 23rd International Geological Congress gave me the opportunity to meet in person for the first time many new colleagues already well known in the field of mathematical geology (*Frits Agterberg, R. B. McCammon, J. W. Harbaugh, R. A. Reyment, A. B. Vistelius, G. S. Watson,* and *E. H. T. Whitten*). Professor *W. C. Krumbein* informed me that his arrival would be delayed. But very early in the morning of Wednesday August 21 all plans were changed with the entry of five armies under the Warsaw Treaty. Because it was impossible to visit the Congress centre I spent part of that day with Professor Reyment, who was staying in a hotel near my home. He made several telephone calls with the Swedish Embassy. It appeared that the current situation prevented any prediction about the future of the Congress.

On the morning of Thursday August 22nd, 1968 special transportation was set up again for Congress participants and some of the Congress program was re-activated. It became possible to use the room reserved for the preliminary discussions planned prior to founding the new Association. The new situation only permitted essential formal administrative steps including the election of the first IAMG Council. Professor *R. A. Reyment* as the Chair of the meeting refused the suggestion of *John Harbaugh* to be elected as President (preferring the position of Secretary General) and asked to elect for this top position *A. B. Vistelius.* Both key functions were unanimously approved. *G. S. Watson* was elected as the Vice-President representing a liaison with the International Statistical Institute. My suggestion to elect the absent *Prof. W. C. Krumbein* to the post of the "Past President" was accepted as well. *T. V. Loudon* was elected as the Western Treasurer. Prof. Watson suggested me for the post of the Eastern Treasurer. After my election I started my official activity for the new Association by suggesting *D. Krige* and *G. Matheron* (in their absence) as IAMG Council members. *F. P. Agterberg, D. A. Rodionov* and *E. H. T. Whitten* as well as the absent *S. C. Robinson* and *S. Sengupta* were elected as further members of the Council while *D. F. Merriam* and *Graham Lea* (absent) were chosen as the first editors of intended IAMG publications. The first IAMG Council had a very good geographical distribution. The election of two Russian scientists to the Council on that day was a testimony in favour of absolute priority being given to personal professional quality avoiding any political concerns.

After a very emotional premature closing ceremony of the Congress on Friday August 23 afternoon I had the honour to represent the IAMG together with *A. B. Vistelius* and *E. H. T. Whitten* at a working meeting of the International Union of Geological Sciences where, in an accelerated process, our Association was officially approved as a new affiliated member. At that time I had no idea how many opportunities were to be awaiting me to work in the IAMG for so many years ahead including my service as the Eastern Treasurer altogether for six terms (1968–1980 and 1984–1996)!

# **38.3 Activities for the IAMG 1968–<sup>1993</sup>**

Various activities of the new Association had to be negotiated, mostly using normal mail. Today it is already difficult to imagine the modest technical means of that time (without any fax or e-mail). However, some personal contacts helped me to make a start with my duties. At that time, my employer—the geo-exploration state enterprise under a new name of *Geoindustria* became the sole collective IAMG member in Czechoslovakia supporting my official activities abroad by financing a lot of my travel expenses.

In January 1969 I visited a conference of mining geodesy in Moscow and paid a visit to *A. B. Vistelius* in Leningrad. The possibility of visiting Western countries continued until the autumn of 1969 and I therefore had no problem to meet with many IAMG Council members at the Congress of the International Statistical Institute in London in August 1969, to spend three weeks in September in France attending a special course on geostatistics at Fontainebleau, and to accept with the consent of my employer the invitation of the Kansas Geological Survey in Lawrence (initiated by *Dan Merriam*) to work there from November 1969 until August 1970. This was an excellent opportunity for establishing many further (already global) useful contacts for my activities for the IAMG and for the international development of mathematical geology. I hold deep memories of my experiences from that time (Bonham-Carter et al. 2008), especially the colloquium on Geostatistics (Nemec 1970) held on the campus in Lawrence and the APCOM Symposium in Montreal (both in June 1970).

In addition to my stay in America I also had to work hard to fulfil my professional duties for Geoindustria. The following text will hopefully disclose how useful working at this cosmic speed during this starting period turned out to be for all the hyper-activities carried out during the remaining almost five decades of my further life.

# **38.4 Příbram—East–West Gate Near the Iron Curtain**

As explained elsewhere (Němec 1993b) a symposium "The Mining Příbram in Science and Technique" was organised for the first time in 1962. The city of Příbram —located 60 km SW from Prague—had a long mining tradition going back to the thirteenth century. In November 1968 several Czech colleagues—mostly geophysicists from the Czechoslovakian Uranium Industry—organised a special session on *Mathematical Methods in Geology and Geophysics* for the first time. They also agreed to organise a special seminar on Geostatistics in Prague and I had the honour —in the course of my visit to France in September 1968—to invite *G. Matheron* and *J. Serra* to take part in that two-day seminar as well as in the new session in Příbram. Both guests were deeply impressed by both the Czech audience and hospitality and Prof. Matheron himself suggested continuing the Příbram meetings with co-sponsorship of the IAMG. I immediately started to promote that idea.

From 1969 I acted as the main convenor of that specialised international session, which actually came about as early as October 1969. We had guests from six countries, but it seemed impossible for *A. B. Vistelius* or *I. P. Sharapov* to attend the meeting (they sent in their written articles). Shortly after the meeting I left Prague to start my temporary work in Kansas. Through contact with the secretariat of the Symposium and with several Czech colleagues (*B. Soukup, M. Škubal*) it was possible for me to continue on from Lawrence with preparations for the next session at Příbram in 1970. Using my new contacts, I was able to successfully promote the idea of also holding these rendezvous at the above-mentioned meetings in Lawrence and Montreal. My work in Kansas terminated in August 1970 and in October there were already 26 foreign colleagues from 11 countries who participated in the Příbram session, together with about 55 participants from Czechoslovakia. We had several guests from America (*Michel David, Dan Merriam,* and *Tim Whitten*), one from India, and also *Dmitry Rodionov* appeared from Russia. Simultaneous translation was used for the first time. This was a very good start for further promotion of this kind of meeting which later took place regularly in October every year until 1973. The 1970 Příbram meeting can be classified as an important milestone of progress.

Since about 1965 the promotion of mathematical methods and computers in the Earth sciences became included in official activities within the framework of the Eastern bloc organization *COMECON (Countries of the Mutual Economic Aid)* and just in 1973 a regular meeting of specialists was planned and organized in Czechoslovakia. Many participants of previous regular meetings on this subject already knew Příbram. It became possible to find a way how to join the official meeting for COMECON delegates (it took place in a locality not far from Příbram) with the regular Symposium (all scientific papers presented in Příbram).

This arrangement made it possible to intensify the already existing East–West contacts. After 1973 the section on Mathematical Methods in Geology was regularly organized every second year—in 1983 again in conjunction with a special COMECON meeting. Many IAMG members from both the West and East were taking regular part in the meetings, e.g. *Tim Whitten* visited Příbram as IAMG Secretary General in 1977 and again as IAMG President in 1983. Also, representatives of COGEODATA were among the visitors and thanks to the initiative of *Jiří Hruška* on several occasions official meetings of that organization were arranged in Prague making it possible for their participants to also take part in the Mining Příbram Symposium. In 1989 and 1991 specific problems of geoinformatics were included in a separate parallel section of the Symposium.

Regular meetings of the specialized COMECON groups were organized in different COMECON countries according to their usual format which involved excluding visitors from other countries. However, both their meetings at Příbram in 1973 and 1983 were unique exceptions lifting scientific programs to a level accessible to all scientists from around the world. I was very lucky that this idea was adopted not only by top representatives of the Czechoslovak geological community but also by the representatives of the COMECON Secretariat in Moscow and by the authorities responsible for that sector especially in the USSR, Hungary, Poland and Yugoslavia.

From 1983 onwards the meetings of Příbram were regularly attended by participants of special courses on geochemistry organized regularly in Czechoslovakia by UNESCO with the School of Mines at Ostrava. At that time, I also had some written contact with UNESCO top representatives (see Fig. 38.1).

In 1987 the section was organized jointly with the GEOCHATAUQUA—held for the first time outside North America (unfortunately, without visitors from that part of the world).

The rapidly changing political situation in the Eastern bloc permitted in October 1989 (6 weeks prior to the November *velvet revolution*) the visit to the geo-mathematical section at Příbram Symposium of many people from the East (especially about 65 guests from the USSR). Altogether 125 visitors from 23

**Fig. 38.1** Letter of the UNESCO Deputy Director General A. Kaddoura to Vaclav Nemec. The French text is a warm expression of thanks for the golden medal of the Mining Příbram Symposium appreciating regular co-operation of the international section of mathematical geology with UNESCO courses on geochemistry organized at the School of Mines in Ostrava

foreign countries (both East and West) with also about 125 colleagues from Czechoslovakia represented a new record of participation.

In 1991 the section was already organized in a new political and economic climate. Members of a new ad hoc *committee* of the IAMG appointed by the IAMG President *R. B. McCammon* for preparing the Silver Anniversary Meeting of the IAMG were present among the participants: *Dan Merriam, Frits Agterberg, Peter Dowd, Mike Hohn* (IAMG Secretary General), and *V. Němec*. Intensive talks were held in my home in Prague prior to the Symposium and everybody seemed to agree with my suggestion to prepare a joint Silver Anniversary Meeting of both IAMG and the Mining Příbram as a festive gathering of Western and Eastern colleagues in Prague following the format of the meetings of the Mining Příbram Symposium in 1993. The resulting information was communicated to all participants at Příbram.

At the IGC 1992 in Kyoto in my paper discussing the 15 geomathematical sessions held regularly at the Mining Příbram Symposia from 1968 until 1991 I had the pleasure to present the following impressive results:


Only 45% of the published full texts or abstracts that were given were represented orally, because of the fact that not every author was allowed to come to Příbram. The State authorities, especially in the USSR and in Eastern Germany were watching and controlling the situation and more freedom for individual visitors only became evident in 1989 when the combination of both political and economic situations had become optimal for the possibility of travel to Příbram.

### **38.5 My Own Professional Work**

In 1972 I was asked by the Central Geological Institute in Prague for a peer review of a book prepared by the Czech authors *Vladimír Sattran* and *Blahomil Soukup* about the application of mathematical methods in geology. It was published in the Czech language in 580 copies (Sattran and Soukup 1973). A large list of publications from prominent authors, both Western and Eastern, represented a very good review and the whole book reflected the actual situation and some promising future development trends.

In my own work at Geoindustria in Prague I had the possibility to continue developing new space and time models for various deposits as well as arranging the agendas for the Mining Příbram symposia. My continuing position in the IAMG Council was accepted by top representatives not only from my employer but also of the Czechoslovak Bureau of Geology. I had the chance to visit at least partially all the International Geological Congresses since 1980 (see Fig. 38.2 from Moscow 1984), and International Stratigraphic Congresses in Heidelberg (1971) and in Nice (1975), APCOM Symposia in Clausthal (1975) and London (1983). Every year I was a regular guest at geomathematical meetings organized in Krakow by Professor *Janusz Kotlarczyk*, in Freiberg (Saxony—Eastern Germany as section of large events), and at many meetings in various parts of the USSR as well as several meetings in Hungary (*Istvan Dienes, Endre Dudich*).

**Fig. 38.2** Václav Němec attending a session on mathematical geology at the International Geological Congress in Moscow (1984). The neighbour of Václav Němec is the highly respected French expert in geomorphology and petrography André Cailleux

I spent one month on a lecturing tour in Italy (1971—see Fig. 38.3) and another lecturing tour in Canada and the USA (1986); also, several working visits in Vietnam and in Mongolia should be mentioned because of the possibilities of making some special contacts with local academic circles (Professor *Ochir Gerel* in Ulaan Baatar).

In all these meetings I was presenting my own (sometimes co-authored) scientific papers, mostly in the domain of space and time models for various kinds of deposits. I always emphasized that special attention should be given to achieving a geologically correct solution by avoiding inappropriate mathematical processes (interpolation) leading to erroneous geological interpretations. My speciality also covered so-called inserted subsystems (Němec 1988).

Every opportunity was used for spreading information about the IAMG and about the possibility of visiting Příbram as the only relatively easily accessible East–West meeting point. The success was partially achieved thanks to my ability to communicate in different local languages.

I had also the possibility to officially invite several specialists to give individual courses or lectures in Prague (*Frits Agterberg, Tim Whitten,* and *Jan Harff*).

In the early 1970s I was already a guest lecturer at the Charles University in Prague, then in the late 80s at the Technical University of Košice and in 1991/92 at the Comenius University in Bratislava, providing special courses about applying

**Fig. 38.3** Announcement of a presentation of Dr. Němec and of his following seminar in Italy The Italian announcement signed by the Rector of the Polytechnic Institute in Torino Prof. Dr. Ing. Lelio Stragiotti informs about a special conference on the application of mathematics to problems of mineral deposits from the point of their exploration and mining exploitation followed by three days of seminars about the computerized evaluation of reserves of deposits. Seminars were reserved for teaching staff and students of the Institute but also accessible for specialists and members of the Sub Alps Mining Association (May 1971, all events were held in the Italian language)

mathematical methods and models in the Earth sciences (including mining processes).

In 1987 I defended my higher scientific degree C.Sc. (candidate in sciences as a Ph.D. equivalent) at the Technical University in Košice (in the mining sciences). The work, without any supervisor, was based on summarizing my development of space and time models for optimizing the long-term mining processes at various kinds of deposits.

### **38.6 Two Separate Silver Anniversary Meetings of Mathematical Geologists in Prague (1993)**

The idea to select Příbram 1993 for a broad international meeting in close co-operation with the IAMG had been discussed originally in 1986 during my trip to North America on the occasion of the Geochautauqua in Calgary and also when visiting *Dan Merriam* in Wichita. These talks continued in Washington DC at the International Geological Congress 1989, when the process of considerable change in the Eastern block was already starting. A few months later the velvet revolution in Czechoslovakia opened the door for fulfilling the idea in a more impressive way. The IAMG President *R. B. McCammon* in particular was emphasising his vision of a broad historical meeting of colleagues from both the West and East. All my activities at that period were oriented toward this goal and all authorities responsible for the Mining Příbram Symposium also agreed with such a vision.

With the help of my wife, Lidmila Němcová I arranged for contacts with the centre Krystal in Prague—working for three main Prague universities and administrated by the University of Economics (where my wife was teaching). This centre seemed to be the optimal place for holding the Silver Anniversary Meeting (technical equipment, advantage of relatively low prices in comparison with other possible centres, hotel capacity, very good access from the airport as well as from the down-town area, good personal contacts with administrators). We had also found several other possibilities of accommodation (some of them in the neighbourhood of Krystal)—at that time allowing people accommodation for only about 10 US\$ per night. The members of the already aforementioned ad hoc committee were able to verify the situation as well as the IAMG President *R. McCammon* who paid his personal visit to Prague in November 1991. We also started to prepare a special "silver" medal for the Silver Anniversary meeting: *Antonín Ryčl*, secretary of the Příbram Symposium, introduced us to the famous Czech medallist *Lumír Šindelář* who after several discussions designed both marvellous sides of it. In April 1992 *John Davis* and *Jan Harff* visited Prague which, in addition to our intensive talks included a visit of the artist. We all expressed strong enthusiasm for the design of the medal and only a few small corrections seemed to be necessary. *John Davis* prepared on his PC a Memorandum of Discussions and later I also received this document from *Dan Merriam* with an accompanying letter giving full approval to all the results achieved. It was possible to arrange for the final production of the medal in the mint house of Kremnica and to continue with the standard preparations for the Jubilee meeting.

In the meantime, I was also very pleased when the IAMG President *Dick McCammon* announced to me by phone that I was elected as W. C. Krumbein medallist for 1991. This highest IAMG award primarily reflected my long-term service to the profession by organizing and keeping uninterrupted contact between East–West between mathematical geologists through the gateway of Příbram.

Unfortunately, some misunderstandings arose: one of them was connected with the side of the medal commemorating the liaison of Prague and Příbram with the IAMG (use of some religious symbols from the Saint Hill—a famous pilgrim locality at the border of Příbram). At that time the renewal of religious freedom was highly appreciated in Czechoslovakia and in other countries of the Eastern bloc. However, the American colleagues entertained different points of view for the standards of international contacts. After my arrival at the IGC in Kyoto (1992) I was asked to arrange with the artist to replace that side of the medal just by the official IAMG logo. Another idea consisted in separating the IAMG Silver Jubilee from the same jubilee of our meetings at Příbram. In my role as the IAMG officer I continued my loyal service to the Association, arranging for contacts with the *Carolina* agency as needed (enabling preparations for the IAMG Silver Anniversary meeting in the Krystal centre). On the other hand, I also had to prepare the Silver Anniversary meeting for the international section of the Mining Příbram Symposium. The respective authorities approved the use of the Krystal centre for that purpose for the days following the IAMG meeting. All potential participants of the "Příbram" Symposium (about 400!) were informed in time by me about the IAMG meeting as well. A special advertisement was published in the Czechoslovak monthly geological magazine.

The final solution resulted in two separate Silver Anniversary meetings taking place in Prague at the same Krystal centre. The IAMG sessions were visited by 152 (mostly Western) people, the Příbram sessions by 140 (mostly Eastern) people. Only about 40 persons attended both meetings. Just one compromise had been finally reached: a common half-a-day meeting accessible to both IAMG and Příbram participants focussed on the history of mathematical geology.

In the end I think that the various misunderstandings and misconceptions connected with the IAMG Silver Anniversary Meeting in Prague also had some positive consequences: more freedom was given to all local organizers of subsequent annual IAMG conferences and the IAMG Councils in the years following until 1999 continued to provide some financial and moral support for the geomathematical sessions organized by the Mining Příbram Symposium.

### **38.7 From The Silver to the Golden IAMG Jubilee**

In 1994 I received a diploma of "engineer" from the University of Economics in Prague as restitution of the violation of my rights when I was not permitted to complete my studies of economics in 1951 in spite of good results in my studies.

I continued to organize the international meetings as part of the Mining Příbram Symposia in the years 1995, 1997 and 1999. These sessions were held again at the Krystal centre in Prague without any help from any official congress agency, and always with the moral and some financial support of the IAMG. *Mike Hohn*—the IAMG President—honoured the session in 1995 by his presence and was able to contact many Eastern participants. Financial support from the IAMG made it possible to pay local expenses and registration fees for about 15 foreign colleagues (for each session). We always had about 80–100 participants from abroad and the scientific level of presentations was good. The new economic situation in the Czech Republic led to decreasing participation from Czech colleagues who were represented by only a small minority.

Czech colleagues who helped me in my organization work until 1989 were not available anymore (being completely absorbed by other activities, retired or deceased). Western colleagues preferred to attend the official IAMG Annual Conferences. For some Eastern colleagues (especially from the countries of the former USSR) a new visa policy demanded lots of extra work for me as a volunteer organizer of the Příbram meetings. Therefore, I decided to stop further activities for the traditional session of "Mathematical Methods in Geology" organized 19 times between 1968 and 1999. I only revived this old tradition in 2011 on the occasion of the Mining Příbram Golden Jubilee Symposium, already reported in connection with my new field of interest in the following text of this article. Very positive remarks were published by Vera Pawlowsky in the Presidential Forum in the IAMG Newsletter (December 2011).

### **38.8 The IAMG Experiences Applied to Develop a New Discipline of Geoethics**

With the inspiration and support of my wife Lidmila Němcová (expert on business ethics) I have worked since 1991 to establish a new discipline in the family of earth sciences—*geoethics.* Originally, the main reason was focussed on ethical problems connected with the non-renewability of mineral resources.

The relatively good start of the new discipline and its rapid development became possible thanks to our extensive contacts established especially in the former Eastern bloc where many colleagues had first-hand knowledge of and personal experience with the Mining Příbram Symposia and with their traditional sessions on mathematical methods in geology.

It is beyond the scope of this contribution to describe the proper development of this new field of interest. On the other hand, I feel it as my duty to express thanks to the IAMG representatives who supported these activities when the development was not yet covered by another association (AGID since 2004).

### **38.9 Conclusion**

I started my final preparation of this article during the days following the death of the famous IAMG promoter Professor *Dan Merriam* as well as at the time of his funeral service in Lawrence. I have never changed my very positive evaluation of himself and of his merits for the IAMG as expressed in my Introduction to the *"Festschrift"* (Němec 1993a). I was deeply moved when reading in the official obituary about the Gold medal of the Mining Příbram Symposium 1970 which was the first place among a lot of other awards for his activities. His personality and his spirit will accompany the readers of this contribution at every page. It is impossible for me to put across his image on this occasion to anybody of the many very happy, pleasant and unforgettable events connected with Dan and other old fellows I had the privilege to meet during my long service to the IAMG.

Let me emphasize my personal conviction that just a trans-generational solidarity is the "secret" explaining the otherwise unbelievable success of the half-a-century IAMG history. A recipe for the further 50 years of the IAMG: Enthusiasm of the young generation should be always accompanied by life experiences and the know-how of the old pioneers.

Vivat IAMG!

### **References**


Sattran V, Soukup B (1973) Použití matematických metod v geologii. ÚÚG Praha, 156 pp. In Czech

Numerous personal reports of V. Nĕmec as well as reports of others about his activities can be found in the IAMG Newsletters at the IAMG website

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **Chapter 39 Andrey Borisovich VISTELIUS**

### **Stephen Henley**

**Abstract** This chapter provides a glimpse of the legacy of Professor Andrey Borisovich Vistelius, who served as the first President of the International Association for Mathematical Geoscientists (IAMG) during 1968–1972.

Professor Andrey Borisovich Vistelius (1915–1995) was arguably the founder of the field of mathematical geology, and he was the first President of the International Association for Mathematical Geology. As a 1982 recipient of the President's Prize (later renamed the Andrey Borisovich Vistelius Research Award) I consider it a great privilege to have been invited to contribute this chapter in his honour. The scientific heritage of Professor Vistelius is extremely rich. His active work on fundamental and applied problems of geology, and especially mathematical geology, continued to the last days of his life. He was responsible for more than 200 published works, each representing a significant contribution to science. His works

S. Henley (✉)

Resources Computing International Limited, 185 Starkholmes Road, Matlock, Derbyshire DE4 5JA, UK e-mail: steve@vmine.net

cover a wide range of subjects, with contributions to the development of stratigraphy, mineralogy, petrography, petrology and geochemistry. The mathematical approach to geoscientific research, pioneered by Vistelius, has gained recognition worldwide. As applied in practice, these works also represent building blocks to more effective methods of search for minerals. There have been a number of publications about Vistelius, and in attempting to present a rounded view of his life and works, this chapter quotes from them extensively: particularly Dvali et al. (1970), Romanova and Sarmanov (1970), Dech and Glebovitsky (2000), Merriam (2001), Henley (2003), Dech and Henley (2003), and Whitten (2004). I also wish to acknowledge unpublished sources including Whitten, the late Merriam, Pshenichny, and Dech.

### **39.1 Background**

Andrey Borisovich Vistelius was born on 7th December 1915 into the family of a Russian nobleman. His father Boris Vistelius was a lawyer in St. Petersburg before the October Revolution of 1917. Boris's father (Andrey Borisovich's grandfather) occupied a senior position in the civil service of the Russian Empire. The relatives of Andrey's mother (the Bogaevsky family) included some distinguished academics. Thus, his maternal grandfather was a professor at the Imperial St. Petersburg Institute of Technology, and his uncle was rector of the Imperial St. Petersburg Academy of Art.

There is no published information on Vistelius' early childhood and how he and his family fared during the turbulent years of revolution and civil war. However, it is known that in 1935, after the assassination of Sergei Kirov, the communist leader of Leningrad (as St. Petersburg was renamed in 1924), Boris Vistelius with his wife and son Andrey (at that time a student aged 20) were exiled from Leningrad like many other intellectuals and noblemen. First the Vistelius family found themselves in a remote village in middle Russia, though later the family was allowed to settle in the city of Samara. Because of this forced deportation, A. B. Vistelius had to interrupt his education at the Leningrad State University (which he had entered in 1933).

His studies were resumed only by good luck. Stalin issued an edict with the slogan "sons are not responsible for their fathers' deeds", and Boris Vistelius sent a letter to Stalin which clearly received a positive reply. This allowed Andrey Vistelius to resume his studies in Leningrad and in 1939 he graduated brilliantly from the Department of Mineralogy which was headed at that time by Prof. S. M. Kurbatov, a pupil of Academician V. I. Vernadsky, the great mineralogist and geochemist who is considered one of the founders of geochemistry, biogeochemistry, and radiogeology.

A. B. Vistelius was a vivid and gifted personality. He had a very extensive knowledge of history and literature (both Russian and foreign), appreciated poetry and read English authors in the original. But geology and mathematics were his overwhelming passions. The research topics he investigated were always of great practical importance and at the same time lent themselves to the innovative and elegantly developed solutions which became a hallmark of Vistelius' work.

He was very sensitive to any dishonesty in science—and especially to political lies. He was known as a sharp-tongued man among his colleagues. Especially under Stalin's rule, officials did not like such people, and it was very hard for Andrey Vistelius to further his career. His scientific honesty, frankness and his manner of open and explicit expression of his viewpoint prevented his elevation to Academician of the Academy of Sciences, the highest scientific institution of the USSR. For the political appointees who, as a rule, were heads of all scientific establishments, he was an irritant, indeed an extreme nonconformist.

Thus, he never denied his aristocratic heritage, at a time when most descendants of noblemen in Russia were trying to obscure their origins, some even changing their surnames during the period of communist rule. In curricula vitae for job applications he repeatedly wrote that he was a nobleman by birth. Of course, copies of all these documents were compulsorily held by the KGB (Committee for State Security of the USSR), and his noble descent was an embarrassment for the scientific authorities, his employers.

During World War II, A. B. Vistelius was trapped in besieged Leningrad. He underwent all the sufferings of Leningradians. He was not enlisted into the army because of poor eyesight. However, despite the war, his studies continued, with award of his 'Candidacy' (roughly equivalent to a western Ph.D.) in 1941, and subsequently his Doctor of Science degree in 1948. After working as a senior scientist in several state organisations, and serving as a director of several geological 'expeditions' (the organisations in the USSR, and later the Russian Federation, responsible for regional geological mapping), he became the director of the newly created Laboratory of Mathematical Geology at the Steklov Mathematical Institute of the USSR Academy of Sciences in Leningrad.

In 1968, Vistelius was instrumental, with others, in founding the International Association for Mathematical Geology, and was elected its first president.

Although his circumstances meant that he was unable to participate in many of IAMG's activities, he continued work as a prolific researcher in Leningrad (subsequently St. Petersburg) with extensive publications in both English and Russian. Whitten (pers.comm.), during a visit to Leningrad in 1971, invited him to Northwestern University (Illinois) which Andrey Vistelius was finally able to accept for the Spring Quarter 1975, and his publication list reflects the results of research projects which he was able to undertake in the US during his time there.

He continued to work in St. Petersburg during the 1970s and 1980s, with a steady stream of research publications, in Russian and in English.

Professor Andrey Borisovich Vistelius died on 12 September, 1995. He continued to work until his last days, with lucidity and inventiveness of thought even in spite of serious illness. In 1992, not long before his death, Kluwer Academic Publishers printed an English translation of his life's work "Principles of Mathematical Geology" (Vistelius 1992). This is a considerably reworked and enlarged English edition of his Russian monograph with the same title (Vistelius 1980).

### **39.2 Scientific Achievements and Insights**

The scientific heritage of Prof. A. B. Vistelius is extremely rich. His active work on both fundamental and applied geology, and especially mathematical geology, continued to his last days. He was responsible for more than 200 published works, each of them presenting a very significant contribution to science. References to many of these are supplied below.

Reflecting the breadth of his knowledge and fields of interest, his works cover a wide range of subjects, dealing with research in the fields of stratigraphy, mineralogy, petrography, petrology and geochemistry. The application of mathematical methods, pioneered by Prof. Vistelius, has gained recognition worldwide. As applied in practice, these works represent a building block to more effective methods of search for minerals.

From his earliest post-graduate studies, Vistelius carved out a career which defined a whole new branch of science—mathematical geology.

The ideas of this newly created field of science were first vigorously supported by Academician Vernadsky and then by Academician Kolmogorov. The high value and prospects of Prof. Vistelius's ideas were emphasized in a review of his works, published by Nature, the international science journal, in 1947. Nevertheless, the ideological regime that reigned in the USSR forced mathematical geology to follow a most difficult path. At that time the Ideological Department of the Central Committee of the Communist Party of the USSR was concerned with purging various branches of science in any way connected with cybernetics, genetics and other newly developed fields which they proclaimed as contradicting Marxist-Leninist ideas. It is sufficient to remember the ill-starred session of the Academy of Agriculture of the USSR in 1948, with Academician Lysenko in the chair, whose actions contributed to the tragic death of Academician Vavilov, a botanist and geneticist of international fame.

For minds narrowed by ideology, mathematical geology was nothing but another suspicious field close to cybernetics. Prof. Vistelius and his group could not avoid this political minefield. Scientific life in the country was totally governed by communist administrators who, on the one hand, did not understand the ideas of Vistelius and sought to deny him the opportunity to work, and on the other hand wished to please higher party authorities. Prof. Vistelius with his unusual mathematical ideas appeared an ideal target. But the ideological attacks on him, fortunately, were not strong enough, and he was defending himself fiercely. This is why the ideological persecution did not bring tragic results. Nevertheless, the damage to his scientific career was considerable. He had to leave the All-Union Oil Geology Research Institute (VNIGRI, Leningrad) where he had been developing the concept of phase differentiation of Paleozoic sedimentary carbonate rocks based on the theory of random functions (nevertheless, brilliantly defended by him in the same year, 1948, as his dissertation for the degree of Doctor of Science).

It is noteworthy that the academic summary "Introduction into the theory of random stationary processes" (the basis for studying phase differentiation of sedimentary carbonate rock), well-known today to mathematicians and specialists in applied science, was first presented only in 1952 by mathematician A. M. Yaglom. This shows that geological phenomena can become a principal material for creation and development of formal mathematical schemes also, as was repeatedly stated by Vistelius. At that period he closely collaborated with the distinguished mathematician, Academician A. N. Kolmogorov, and worked with him on a very important problem of sedimentology relating to the formation of sedimentary strata. As a result, Kolmogorov wrote a paper "Solution of one problem of the theory of probability, related to the problem of mechanism of bed formation" published in "Doklady AN SSSR" (Kolmogorov 1949). The methods of solving this problem were further discussed by M. F. Dacey in his paper "Models of bed formation" (Dacey 1979). There are other examples of such development of formal mathematical structures, for instance, mathematical investigations developing the formalisms of finite Markov chains and processes along with their geological applications, by mathematicians B. P. Harlamov and A. V. Faas in close collaboration with Vistelius.

In 1952 Prof. A. B. Vistelius was invited to join the Laboratory of Airborne Methods of the Academy of Sciences of the USSR (AS USSR). There, with the support of N. G. Kell, the director of the laboratory and a Corresponding Member of the Academy, he organized a group to carry out investigations not just in the field of airborne methods, but mainly in the field of mathematical geology. At this time (before 1960) his group researched several approaches to the problem of comparison of geological sections and reconstruction of the processes of bed formation using the theory of random processes. A. B. Vistelius was actively involved in development of methods of statistical evaluation and examination of hypotheses able to provide the necessary validity for comparison of a model with geological observations.

Despite the obvious importance of the results of Vistelius' work, and the support given by Academicians Kolmogorov, Korzhinsky, Belyankin, Linnik and later Artsimovich, the academic Department for Geology and Geography was too closely connected with the Ideological Department of the Central Committee of the Communist Party and impeded the development of mathematical geology whenever possible. In response, in 1961 the mathematical academicians transferred the group headed by Prof. Vistelius to the Leningrad Branch of the Steklov Institute of Mathematics (LOMI) of the USSR Academy of Sciences. The branch was headed by Prof. Petroshen, a well-known mathematician who specialized in seismic fields, and who encouraged the work of Vistelius' group. There it was set up formally as the Laboratory of Mathematical Geology. It is noteworthy that such a decision was an indication of the fact that the structure of the Academy of Sciences was like "<sup>a</sup> state within a state". Sometimes it was able to take actions which ran counter to the wishes of the Central Committee of the Communist Party.

The Academy of Sciences was precisely the right environment for initiating thorough field investigation, allowing disinterested scientific research, to develop the fundamental principles of mathematical geology. A. B. Vistelius, with broad experience in different fields of geology, developed ideas for the introduction of mathematics into geology systematically and with clarity of purpose.

By the end of the 1970s he demonstrated the advantages of using the methods of mathematical geology that he had developed to a range of questions in mineralogy, petrography, lithology, petrology and more general problems of regional geology in the fields of paleogeography, lithostratigraphy, and geochemistry. The results of his studies showed that mathematical methods were not to be confined to summarisation of geological information, or to identification of geological events and phenomena on the basis of numerical calculations, but could provide a means of expressing geological concepts in mathematical language. The line of inquiry that was defended by A. B. Vistelius and determined by that time as "mathematical geology" leads geology to a higher level, demanding more concrete and accurate notions about objects or processes under consideration than is possible without the application of mathematics.

His group's scientific work in LOMI, an outstanding internationally recognised mathematical research centre, however, entailed some specific problems. The mere principles of solving tasks of mathematical geology did not raise any objection in the institute, but the choice of propositions for each geological mathematical model remained hard to understand for mathematicians, including the hierarchy of the institute. The institute's administration consisted of theoretical mathematicians who needed only a sheet of paper and a pen for their work. It was hard to persuade them that geology needs field work and an experimental basis to obtain the data necessary to construct and verify models.

This is why Prof. Vistelius had to look for another more suitable host organisation for the Laboratory of Mathematical Geology. This difficulty, as well as the importance of mathematical geology, were met with understanding by A. P. Aleksandrov, the President of USSR Academy of Sciences, in 1986, and in the following year he moved the Laboratory of Mathematical Geology from the Department of Mathematics to the Department of Geology, Geochemistry, Geophysics and Mining of the Academy by attaching it to the Institute of Precambrian Geology and Geochronology (IGGD, AS USSR).

Then, however, it became immediately apparent that a traditional geologist and a mathematical geologist spoke different languages and the majority of geologists did not understand the mathematical approach to modelling geological phenomena despite the fact that mathematical geology had existed for more than forty years.

It seemed that transformation of the Laboratory of Mathematical Geology into an institute was overdue. The necessity of such a decision was repeatedly stressed by a number of senior scientists such as Academicians Sokolov and Laverov (who was an acting Vice-President of the Russian Academy of Sciences). But this idea was achieved only in 1991 when the Russian Academy of Natural Sciences (RANS) was founded. Prof. Andrey Vistelius was named an Honorary Member of this Academy at the first elections and charged with organization of an Institute of Mathematical Geology.

Vistelius' Laboratory of Mathematical Geology together with the Laboratory of Petrophysics and Mathematical Geology of the Earth's Crust Institute of St. Petersburg State University, constituted the basis of the institute. However, RANS is not a government institution and it had no support from the federal budget. For this reason RANS could not supply the Institute of Mathematical Geology with appropriate financing. The Ministry of Science and Technology of the Russian Federation agreed to subsidize the institute after difficult negotiations. The institute, for its part, took on large obligations in solving some practical geological problems by means of mathematical geology.

Dech and Glebovitsky (2000) give a detailed account of the many fields in which the work of Vistelius advanced geological knowledge through his deep understanding of underlying geological processes and innovative application of mathematical methods.

To understand fully Vistelius' immense contribution to the geosciences, it is necessary first to identify the different and complementary approaches to the subject. The two principal approaches can be summarised thus:


Andrey Borisovich Vistelius, with a firm grounding in scientific method, was a strong advocate for genetic models and hypothesis testing. Not only was this theoretically more fulfilling, but also it did not generally require the massive computer power that was not available to him in the Soviet Union.

Vistelius' beliefs as expressed in 1968, were confirmed recently in a brief historical review (Dech and Henley 2003, p 368) of his 'scientific heritage', where it was noted that he

*. . . supposed, and for good reason, that if a science does not use mathematical modelling in constructing its conclusions, "then it can be considered as belonging to the pre*-*Newtonian period, in other words such a science lags behind the present*-*day level of research by approximately 300 years" (Vistelius* 1991*). He understands that the new scientific paradigm of conceptual modelling of geological processes and objects will not be adopted by conservative geologists, the majority of whom continue to use old methods. And he writes that such a situation must be essentially changed, as to enter the twenty*-*first century with such a considerable time*-*delay is simply dangerous, not least for economic development.*

### **39.3 The International Association for Mathematical Geology**

Vistelius' participation in the IGC in Prague in 1968 was fortuitous from several standpoints. Prior to the Congress, Reyment had been the first Visiting Research Scientist at the Kansas Geological Survey (1966–67) where the idea of an International Association for Mathematical Geology (IAMG) was conceived. The first hint of mathematical geology as a subject in its own right had actually come to Reyment's attention in the late 1940s from some of Vistelius' work. Reyment then visited Vistelius in Leningrad in the early 1960s while in the USSR as a research associate at Moscow University on exchange from the University of Stockholm. From his contact with Vistelius and his experience in Kansas, Reyment had the idea of sending a questionnaire to possible interested participants in such an organisation; he received an overwhelming positive response, and an especially enthusiastic one from Vistelius. Later, at an ISI (International Statistical Institute) meeting in Australia, Reyment conferred with a group of international scientists, including Chester Bliss, founder of the journal *Geometrics*, and the IAMG concept was nurtured (Reyment pers. comm., 1993). On April 9th, 1968, Reyment asked for approval of a proposed set of statutes in a letter "To all Committee members": "(1) *<sup>I</sup> am in agreement with the draft statutes of Professor Whitten, amended by Prof. Vistelius and Dr. Marsal and including suggestions from Dr. Agterberg, Mr. Schlegel, and Professor van Leckwijk, …*". The founding IAMG committee adopted these statutes, and the IAMG then applied for affiliation with the International Union of Geological Sciences (IUGS) and the International Statistical Institute (ISI). The proposal for affiliation with the IUGS was supported by S. Van der Heide, Secretary General of IUGS, and accepted at the Prague meeting as a result of prodding and cajoling by Reyment, and thus the IAMG was officially born.

Vistelius had served on an ad hoc exploratory committee and then was member of the Organizing Committee and attended, along with 19 other members, the first meeting of the committee in Prague. Eight of the attendees were from the Eastern Bloc; their attendance in Prague was allowed as being relatively 'safe.' It was the understanding of the other attendees that the 'Warsaw Pact' attendees were there on military visas (for reasons which were obvious later). The events during the Congress substantiated that understanding. Vistelius' participation in the IGC gave him visibility to Western scientists and those contacts (with Frits Agterberg, John Harbaugh, Tim Whitten, and Dan Merriam) were invaluable to him later.

Reyment had prepared a slate of officers to be ratified by the representatives, and it was no surprise he nominated Vistelius for president. Reyment was aware of and impressed by Vistelius' work (through his Russian publications and personal contact). He was an obvious choice for the position with Reyment's backing, and because Bill Krumbein, another possible choice for the office, was not interested, Vistelius was in but, Krumbein was elected the first past president! Reyment was elected Secretary General.

There was considerable discussion about the designation and focus for the new organisation. Proposed for the name of the Association's newly created journal were such adjectives as geometrics, geomathematics, mathematical geology, numerical, quantitative, etc. Vistelius championed 'mathematical geology' and, for a variety of reasons, that name was agreed on. The new Journal of Mathematical Geology was contracted to be published by Plenum Press. In 1969 in the first issue of the fledgling journal, Vistelius, as President of IAMG, wrote a Preface on the 'mathematization of geology' and contributed a short note.

At the inaugural meeting of IAMG, Andrey Vistelius championed the concept that Mathematical Geology is a separate branch of science (like Mathematical Physics) based on testing geological hypotheses mathematically, and that this science should be accepted as the primary focus of IAMG. He suggested it is not particularly important or interesting merely to manipulate geological data statistically. These had been his contentions for many years, though few of those present in 1968 appreciated the fact—and their primary objective was solely to initiate IAMG. It was not until several years later that their full significance and the historical importance of his earlier publications became clear to those outside the Soviet Union. Although it can be argued that Vistelius was largely correct, process modelling combined with objective hypothesis testing has received little attention among IAMG members over the ensuing years (Whitten 2003).

Because of the restrictions on travel and communication placed on Vistelius, most of the IAMG work load fell on Reyment as Secretary General and Merriam as editor of the new journal. Vistelius' direct contribution to the IAMG was minimal through no fault of his own, and later he served a 4-year stint on the Council helping prepare the IAMG sessions at the IGC in Moscow. Reyment succeeded Vistelius as president and by that time in 1972 the organisation was firmly established.

Vistelius attended few 'official' IAMG meetings. Because of his circumstances, it was difficult for him to make much direct contribution, except in name, to the activities of IAMG. Vistelius' unique and important scientific contributions, however, were recognized by the IAMG by awarding him the Krumbein Medal (the IAMG's highest honour) in 1980 (unfortunately he was unable to attend the IGC in Paris and collect his medal personally) and naming one of their awards in his honour. After IAMG created the Krumbein Medal in 1976, Merriam proposed another annual award for an outstanding young scientist, to be named in honour of Vistelius. The proposal was rejected by the Russian authorities on the grounds that such an honour could not be conferred upon a living person. Thus, the award was designated the President's Award in 1980 and subsequently changed to the Vistelius Award, as originally intended, after his death in 1995.

# **39.4 The "Father of Mathematical Geology"?**

Andrey Vistelius has often been referred to as the "father of mathematical geology". He was indeed the first president of IAMG, but there are many other pioneers in the field who could also be acknowledged by the title of "father" (including among others Krumbein, Griffiths, Matheron, Chayes, Krige, and Schwarzacher). Merriam (2001) names W. C. Krumbein as the "father of *computer* geology", but of course this is not quite the same thing. Vistelius, himself, as noted above, was ambivalent towards the use of computers.

The history of development of mathematical geology [in the broad sense] is essentially two stories (East and West) with little connection or interaction until near the end of the 20th Century. The two schools developed independently and partly in parallel in response to changes in the science. The quantification of geology began in earnest from modest beginnings of a few quantitatively oriented researchers, such as Vistelius, Krumbein, and Griffiths among others.

Vistelius' death in 1995 (Krumbein had died in 1979 and Griffiths in 1992), ended an extraordinary era in the growth of quantitative (mathematical) geology. Along with the rapid development of quantitative techniques and their adaptation to computers, these advances spread throughout the science and allowed rapid strides and changes to be made in the earth sciences.

Never before in the past, and probably never again in the future, will such rapid progress be made in such a short time, fostered by such a small group of dedicated, forwarding-thinking geo-giants.

### **39.5 Legacy**

It is traditional to discuss the legacy of outgoing political leaders, to assess their place in history and to estimate the quality and quantity of their achievements in the light of effects on subsequent developments. Similar discussions take place over the legacy of our foremost scientists, among whose number Andrey Vistelius must surely be counted.

His rigorous scientific training led him to develop his ideas of applying mathematical methods in modelling geological processes, to allow statistical testing of hypotheses against real data. This contrasted starkly with the approach of many western geoscientists, of using data processing capabilities of computers to fit the data using standardised methods. The latter approach allowed the identification of patterns in data, but rarely provided scientific insight into the underlying geological processes. In the English-language literature, perhaps the outstanding example of Vistelius' approach is the book *Computer Simulation in Geology* by Harbaugh and Bonham-Carter (1970) which identifies a wide range of geological process models which can be defined mathematically and implemented in computer code.

The process modelling approach pioneered by Vistelius is now making serious contributions to the geosciences. For example, in the work of Alison Ord, Bruce Hobbs, and colleagues in Australia and elsewhere, mathematical models from a number of hitherto separate fields have been combined into complex models with their recognition that the interactions of rock deformation, fluid flow, thermal transport, and chemical reaction are integral to geology. Prediction requires quantification of the processes and their interactions. What is observed is demonstrably multifractal so that we must explore and apply all that nonlinear dynamics has to offer (Ord and Henley 1997; Ord et al. 2002, 2007, 2012, 2016; Hobbs et al. 2010; Hobbs and Ord 2015, 2016).

The other approach is best typified by the field that is generally known as "geostatistics". Originating in the work of Matheron and many others, this uses purely mathematical concepts to fit models to the data. These models bear little or no relation to underlying geological processes, and the results are purely descriptive. In attempts to improve the quality of fit to the observed data sets, over the past 40 years progressively more complex mathematics has been developed, using assumptions about the statistical properties of data sets which have steadily less justification in the underlying geological processes. The history of development of geostatistics is reminiscent of the iterative refinement of the Ptolemaic astronomical model when circular planetary orbits were found to be incompatible with observations, and epicycles were added in an attempt to improve the fit. The problem, of course, was that the model was itself a mathematical fiction bearing no relation to the laws underlying planetary motions. Similarly, geostatistics is purely descriptive and bears no relationship to actual geological processes.

While geostatistics itself continues to be widely used, the more scientific approach espoused by Vistelius remains very much alive. Even though many of its practitioners are unaware of the debt of gratitude they owe to this pioneer, their work nonetheless is tribute enough.

A special issue of the Journal of Mathematical Geology (volume 35, number 4) dedicated to the memory of Vistelius was published in 2003 and contains papers by many of his former colleagues, as well as one previously unpublished paper by Vistelius himself (Dech et al. 2003; Vistelius and Pavlov 2003; Azimov and Shtukenberg 2003; Harlamov 2003; Voytekhovsky and Fishman 2003; Podkovyrov et al. 2003; Kotov 2003). The breadth of geoscientific subject matter and mathematical approaches shown by this collection of papers is ample illustration of the scientific legacy of Andrey Borisovich Vistelius.

### **References**

Azimov P, Shtukenberg A (2003) Numerical modeling of growth zoning at nonstationary crustallization of solid solutions: metamorphic garnets. Math Geol 35:405–<sup>430</sup>

Dacey MF (1979) Models of bed formation. Math Geol 11(6):655–<sup>668</sup>


### **Publications of A. B. Vistelius**


means of the series ∑ *k i* = 0 *eaix*<sup>+</sup> *bi* cos ð Þ *<sup>ω</sup>ix* <sup>+</sup> *yi* . Doklady Akad Nauk USSR 49(7):531–<sup>535</sup>


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **Chapter 40 Fifty Years' Experience with Hidden Errors in Applying Classical Mathematical Geology**

**Hannes Thiergärtner**

**Abstract** Classical mathematical geology is a branch of mathematical geosciences in which mathematical methods and models—not specifically developed for and not exclusive to specific geosciences—are applied to describe, to model and to analyse quantitatively geoscientific subjects and processes. It was the dominant approach in the 1960s to 1980s and it is still used today to solve numerous, mostly limited and less complex problems. The methods have been implemented in the form of algorithms in commercial software packages that are widely used in geological practice. Their application frequently assumes specific pre-conditions, which are often difficult, if not impossible, to verify. This situation can result in significantly spurious output and errors that are often not recognised (hidden errors). In this paper five case studies are used to demonstrate these errors. In particular, they demonstrate that small mistakes can lead to serious, but often unrecognised, misinterpretations. The main conclusion is that there is a need to improve education and training in classical mathematical geology especially for engineering sections of consulting firms, governmental agencies and individual consultants.

**Keywords** Mathematical geology ⋅ Application ⋅ Case studies Error ⋅ History of the IAMG

### **40.1 Introduction and Definitions**

The application of mathematical formulae and methods to solve geological problems started decades before the International Association for Mathematical Geosciences (IAMG) was founded. Initially, simple methods were used to compute derived parameters such as petrochemical mineral norms or grain size distributions and grain shapes. W. C. Krumbein in Chicago and A. B. Vistelius in the former

H. Thiergärtner (✉)

Department of Geosciences, Free University of Berlin, c/o Kohlisstraße 65, 12623 Berlin, Germany e-mail: thiergartner@aol.com

<sup>©</sup> The Author(s) 2018 B. S. Daya Sagar et al. (eds.), *Handbook of Mathematical Geosciences*, https://doi.org/10.1007/978-3-319-78999-6\_40

Leningrad (now again St. Petersburg) were the first to introduce probability-based statistical methods into geoscientific applications. In the 1960s sophisticated mathematical methods were increasingly developed and applied simultaneously with the development of electronic data processing. Numerous monographs were published to introduce these new tools to geologists (Table 40.1). Step by step, a new sub-discipline—termed "mathematical geology"—was established. It was within this context that the IAMG was established as an association within the International Union of Geosciences at the International Geological Congress in Prague 1968.

The majority of the methods introduced into the geosciences between the 1960s and 1980s were based on probability-statistical or heuristic models. Due to their high level of abstraction, these methods are equally applicable to the solution of analogous problems in other natural or social sciences provided the required data are available. Table 40.2 summarises some essential methods belonging to this group. For the purposes of this paper, these methods and models are classified as classical mathematical geology. (The term "geology" has recently been replaced by "geosciences" but the latter includes the former).

Classical mathematical geology applies mathematical methods and models, which comprise procedures that are not developed specifically for geosciences and which do not bear any direct relation to geological subjects or geological processes. They are extensively implemented in software packages such as Statistical Package for the Social Sciences (SPSS), and have been described in detail in the literature (e.g., Bühl 2016).

Over recent decades the development of mathematical geosciences has resulted in many new advanced models. These models have mostly been developed for specific geoscience applications such as basin modelling, groundwater flow models, contaminant transport models, heat flow models, and so they differ from the classical mathematical geology. This contribution does not cover these specific methods and models.

Classical mathematical geology models retain their applicability and practical advantages. They are helpful tools when other (specific) approaches are not available, when the development of a new model is disproportionately, when the geological problem does not require specific solutions or when limited questions are to be answered on the basis of few data. To date, this area of mathematical geology has not been replaced by later developments and it remains a useful component of the complete set of methodologies.

### **40.2 Hidden Errors and Case Study Examples**

In the course of the past 50 years many correct, useful results have been generated by the application of classical mathematical geology. Whilst the application of classical mathematical geology does not necessarily result in incorrect or inaccurate solutions of geoscientific problems, it does have the potential to do so. Incorrect and


**Table 40.1** Early monographs of mathematical geology

(continued)


**Table 40.1** (continued)

(continued)


**Table 40.1** (continued)


**Table 40.2** Selected models of classical mathematical geology

frequently undetected errors can occur if the user is not sufficiently experienced with mathematical methods, with data processing problems or with the application of computer software. These problems occur mostly in geoscientific practice, especially when time and/or finance are restricted and the work is subject to pressure to produce positive results.

Spurious results or undetected errors result from apparently negligible inaccuracies such as unavailable or insufficient knowledge of data accuracy and precision, the uncritical use of values below the threshold of measurement, merging of different input data types, statistical parameter estimations without preliminary tests of the underlying type of frequency distribution, the restriction of correlation analyses to linear models, an unsuitable selection of non-supervised classification models and strategies, inclusion of non-informative attribute sets into the data file, missing information about the significance of statistical results, the acceptance of meaningless correlations, uncritical spatial or temporal extrapolation of trend-analytical results.

The application of classical mathematical geology methods and models requires frequent consideration of specific (mathematical) conditions such as the existence of a certain probability distribution, the independence of variables, a minimum number of observations, the proper treatment of missing values, a suitable choice of cluster model and strategy. Usually, long-term experience of the correct interpretation of results is necessary to avoid errors. All these fundamental conditions appear to be rarely included in training programmes and apparently insufficiently taught in courses. Commercial software is easy to handle but no signal alerts the user to the absence of essential pre-conditions and consequent occurrence of an inherent error in the results. Must computer-generated results be accepted as unbiased and reliable simply because they are produced by electronic equipment? Five selected cases derived from earlier projects will be used to demonstrate the problem in detail.

### *40.2.1 Bathymetric Map of the Azores*

The archipelago of the Azores (Ilhas dos Açores) consists of nine islands and a reef area in the North Atlantic Ocean and is the result of partially active volcanoes. It covers an ocean surface between 31°30′ and 24°30′ W and 36°30′ and 40°00′ N (Fig. 40.1).

The Azores are situated on the Azores plateau, an area of thickened oceanic crust due to submarine volcanism caused by a hot spot at the Azores triple junction. The NE-SW striking Mid-Atlantic Ridge crosses the plateau between the Graciosa Island and Terceira and continues over São Jorge and Pico. Along this tectonic element, the North American plate and the Eurasian Plate drift to the west and the east respectively. The Corvo and the Flores islands belong to the American plate. The NW-SE striking Terceira rift runs from the island Graciosa over the São Miguel island to the southeast. This is the tectonic line along which the African plate is subducting under the Eurasian plate. The volcanic and seismic activity started in the Miocene epoch and the formation of the islands continued during the Neogene period.

This entire part of the Atlantic Ocean is of great geological and economic interest and is the target of numerous geoscientific expeditions. The sea floor

**Fig. 40.1** The Azores. Area of investigation

consists of basaltic rock and young volcanic glasses covered by abyssal clay and biogenous and clastic sediments (cf. Hübscher 2015). The close proximity to the crustal magmatic events causes the formation of important raw materials such as manganese nodules.

A fundamental component of marine survey expeditions is to make depth soundings of the locality. The depths measured in the early 1980s were interpolated by specialists at a computer centre to construct bathymetric contour lines. They used kriging interpolation, the results of which are shown in Fig. 40.2. These results do not reflect the expected predominant NW-SE striking structures described

**Fig. 40.2** The Azores. Bathymetric contour lines based on inaccurate input data

**Fig. 40.3** The Azores. Bathymetric contour lines based on corrected input data

above and give a distorted representation of the main morphological structures. Investigation showed that a suitable mathematical model was applied but the wrong input data were used: geodetic coordinates were used but the minutes were recorded as decimal places. This error was not detected in the computer centre. The result obtained after correcting the data is shown in Fig. 40.3 in which the map more closely reflects the main morphological structures of the investigated area (Open-SeaMap 2016).

### *40.2.2 Granulometric Analysis of Coastal Sediments of the Southern Baltic Sea*

The Bay of Greifswald (Greifswalder Bodden) in Germany occupies the south-central part of the Baltic Sea. Holocene sand, gravel and boulder cover late Pleistocene till and basin sand. The recent material originated from an active cliff and from an abrasion platform (for details, see Niedermeyer et al. 2011). The fine, medium- and coarse-grained sediments show a lithological differentiation more or less parallel to the erosional shore line. The grain size is specified using the European standard DIN EN ISO 14688-1 (2013).

Knowledge of the characteristics of the sediment is important for designing measures to protect the coast and is necessary if the raw material is to be exploited for building purposes (cf. Börner 2011). One of the relevant parameters is the grain size. A principal component analysis was conducted to reduce the dimensions, or the number of manifest attributes, to a smaller number of latent components which

**Fig. 40.4** Baltic Sea. Clastic sediments of the south coast. Principal component analysis of grain size data

largely explain the variance of the input data, and to avoid an undesirable multi-collinearity, i.e. to obtain a set of essential information (Fig. 40.4). A cluster (R) analysis should explain the relationships between the original grain size classes (Fig. 40.5). The result reflects only a trivial fact: the coastal sediments are mainly composed of silt and fine sand if they are not coarse-grained, and vice versa.

The input information scaled in mass% was correctly recorded. However, the fact that the sum of all sieve fractions amounts to the constant sum of nearly 100% was ignored. This closed system means that mathematical results based on correlation among the attributes must be faulty. Chayes (1960a, b, 1971) and Vistelius and Sarmanov (1961) showed that the so-called percent correlation leads to unfeasible results. The modern approach to processing data that form a closed system was developed only later and therefore could not be applied (e.g., Pawlowsky-Glahn 2005; Pawlowsky-Glahn and Buccianti 2011).

### *40.2.3 Areal Distribution of Polycyclic Aromatics in an Abandoned Industrial Site*

Until its abandonment an extensive industrial site in Germany was used for machine manufacture. During later assessment for redevelopment the site was investigated for possible ecological contamination. The disused, unsealed enterprise is located on near-surface Holocene sand and gravel. The consultants sampled and analysed twenty-five soil specimens and detected an appreciable concentration of polycyclic aromatic hydrocarbons (PAHs) at two locations. PAHs belong to a group of extremely carcinogenic substances. The 16 most important and persistent constituents are on the National Priority Pollutant List of the US-EPA. An occurrence of these hydrocarbons in subsoil typically requires appropriate remediation measures.

A map of the distribution of the pollutant within the site was constructed by means of kriging (Fig. 40.6) and an expensive soil excavation at well no. T20 over

**Fig. 40.6** Contaminated industrial site. Contour map of the apparent distribution of polycyclic aromatics in subsoil

75 m2 and at well no. T05 over 46 m<sup>2</sup> was proposed. The application of this geostatistical model to create a contour map is a widely used technique in mathematical geosciences. Similar cases have been repeatedly observed. Less well-known is that the analysed attributes must be interpolable between adjacent points.

PAHs include a wide spectrum of organic substances with relatively low solubility in water, e.g. naphthalene (32 mg l−<sup>1</sup> ), acenaphthylene (3.4 mg l−<sup>1</sup> ), acenaphthene (4 mg l−<sup>1</sup> ), fluorine (1.8 mg l−<sup>1</sup> ) and pyrene (0.134 mg l−<sup>1</sup> ). Their solubility in water and tendency to migrate into aquifers rises if solvents such as mineral oil, halogenated organic compounds or phenols are present. PAHs will be generated during coking processes or coal-gas generation but they never occur as waste products of machine manufacturing. A study by a project controller showed that a gas generation facility had been operational on the site until 1898. Coal tar was an unprofitable by-product at that time and it was frequently deposited near the factory. Thus, PAH bearing waste was also deposited locally, at distinct locations. Originally included fluids are removed by natural weathering processes over decades. At present, the remaining solid PAH components are persistent and relatively immobile (Stupp and Püttmann 2001). A result of this man-made impact is a spatially limited, although not tolerable, area of contamination. Any extension of these spatially limited occurrences caused by mathematical interpolation methods is meaningless.

The groundwater flow direction must be included in the risk evaluation if contaminants in unconsolidated subsoil are water-soluble and if they are able to migrate. Contour maps generated by standard kriging cannot consider this factor and its application would also result in an incorrect result.

The resulting insolubility of the pollutants under natural conditions causes their inability to migrate. Due to this property of the contamination, it is not correct to interpolate the detected PAH concentration values between observed locations. An isoline map predicts an area-wide contamination whereas only local and isolated pollution actually occurs. Later, it was recommended that the survey data be presented in the form of a point map (Fig. 40.7) and to focus future remediation on the observed hot spots.

A similar case study was discussed by Thiergärtner (1995).

### *40.2.4 Ore Grade Estimation in a Cassiterite Mine*

Tin ore has been mined for centuries in Altenberg (Saxony, Germany). Monzo-, aplite- and albite-granite intruded during the Cisuralian epoch (Permian) into Precambrian paragneiss and were followed by acid, fluorine and silica rich overcritical auras. Feldspar was mainly altered to quartz; lithium bearing mica, topaz, fluorite, and ore minerals such as cassiterite, wolframite, and molybdenite crystallised in the form of small grains. For details, see Weinhold (2002).

Thirty samples were taken from an exploration gallery to calculate the mean grade of the deposit yield in the investigated direction. The range of the metal

**Fig. 40.7** Contaminated industrial site. Hot spot map of polycyclic aromatics in subsoil

content was 0.07–4.22% tin. The arithmetic mean of all analysed sample values was computed "as usual" to be *marith* = 0.755% Sn. Inspection of the empirical histogram and the fitted normal distribution curve (Fig. 40.8) showed that the metal grade was extremely skewed to the right. A lognormal distribution was fitted to the input data (Fig. 40.9) and the arithmetic mean of the (decimal) logarithms of tin grade was calculated and the corresponding antilogarithm (*mgeom* = 0.512% Sn) was obtained. This value is less than the arithmetic mean. The geometric mean is a location parameter, such as the median or mode, and scarcely suitable to estimate the expected value of a population.

**Fig. 40.8** Tin grade. Arithmetical mean value estimation

**Fig. 40.9** Tin grade. Lognormal mean value estimation

Which estimator should be applied? The statistical "best" estimator Ê(mlg) of the expected value for lognormally distributed data is calculated using Eq. 40.1 developed by Aitchison and Brown (1957) and applied e.g. by Dowd (1984):

$$\hat{\mathbf{E}}\left(\mathbf{m}\_{\rm lg}\right) = 10^{\text{mean}(\lg X)} \mathbf{e}^{\mathbf{k}} \left(1 - \frac{\mathbf{k}(\mathbf{k} - 1)}{n} + \cdots\right) \tag{40.1}$$

where k = 2.65095 var (lg X) and *n* = number of observations.

This estimator is rather poorly known in geoscientific practice. The estimation results in *mlg* = 0.765% Sn for the given example. Only this value can be applied to estimate the mean tin grade of the investigated gallery in an unbiased way. The true ore grades of samples in an operating underground mine can be used to estimate the mean grade of un-mined volumes of ground and this is one of the most important parameters in determining economic feasibility.

### *40.2.5 Classification of a Doleritic Sill Using Trace Elements*

Tholeiitic basalt occurs in the Thuringian Forest (Germany) as Sakmarian doleritic sill (Lower Permian). It is intruded into a sandstone–siltstone formation. The contacts are metamorphosed. This sill was extensively described recently by Andreas and Voland (2010).

The matrix of the dolerite consists of pyroxene, plagioclase, olivine, alkali feldspar and some magnetite. The drill core was partitioned into seven sections by petrographical analysis (Table 40.3; Fig. 40.10a). Megascopic and mineralogical indications differ negligibly.


**Table 40.3** Vertical sections of doleritic sill, Thuringian Forest

**Fig. 40.10** Doleritic sill. Geochemical classification (for explanation see Table 40.3)

The interesting question was to correlate this sequence with its chemical composition. Four trace elements were selected: cobalt and nickel which are fixed in the olivine mineral replacing magnesium, zircon as a constituent of feldspar or pyroxene, and copper which can occur in the form of microscopically small chalcopyrite crystals. These four chemical elements were analysed in 79 samples covering the whole sequence.

A hierarchical cluster analysis was carried out. It was based on z-scaled (normalized) input data to avoid an overestimation of attributes with large values, using the squared Euclidean distance measure and the Ward method. The cluster dendrogram shows clearly distinguishable classes and can be interpreted without difficulty. All resulting and interpretable classes have been assigned to the cross-section (Fig. 40.10b). Chemical symbols without brackets refer to high concentrations of an element in Fig. 40.10. Where medium concentrations occur they are enclosed in brackets and missing chemical symbols indicate low contents in the sample. The expressions High, Medium, and Low refer to the overall mean values. The figure shows first that the geochemical composition differs noticeably from the petrographical structure. The number of clearly distinguishable geochemical classes is low. Thick parts of the sill seem to be characterised by a similar micro-chemical composition. It is obvious that samples with lower depth (hanging-wall samples, marked by *h*) dominate the hanging-wall of the sill, and samples taken from the footwall (marked by *ly*) mainly occur at deeper levels. The middle section comprises samples that were collected at depths between 200 and 300 m (marked by *m*).

The results gave sufficient reason to review the methodological approach. First, it was noted that the depth was included as one of the input parameters and the parameter "depth" significantly influences the classification. Such procedures are not faulty in a mathematical sense but they accentuate the effect of neighbouring samples within a common class due to the similar value of the parameter "depth". Within the drill core neighbouring samples have a higher chance of falling into this common class than do the more distant samples. This effect should be avoided if not explicitly requested by the researcher. The relatively long sections of the profile with little or no geochemical variation can be explained by this effect. Secondly, the critical test showed that the inclusion of both cobalt and nickel into the analysis caused an overestimation of the olivine component. The linear pairwise correlation coefficient (Pearson) between Co and Ni was calculated as *r* = +0.915. Cobalt is not significantly correlated with copper or zircon, and copper and zircon are uncorrelated, too. This result could be expected from the relationships of the geochemical bonds.

A repeated cluster analysis based on the attributes Ni, Cu and Zr resulted in classes which were drawn into the rock sequence as shown in Fig. 40.10c. The influence of the depth is eliminated and the double effect of the trace elements Co and Ni—reflecting the olivine content—is reduced to only one factor. Much more detail is visible; i.e. a clear vertical geochemical differentiation can be recognised. In addition, the resulting geochemical classification of the rock profile does not simply correspond to the mineralogical and petrographical structure and displays more essential details than the first result. Although the mathematical model was chosen correctly, an incorrect set of input data was applied to solve the problem.

### **40.3 Conclusion and Suggestions**

Classical mathematical geology is the application of non-specific mathematical methods and models to solve geoscientific problems. Its proper application frequently results in useful solutions but its misapplication can generate spurious results that may not be recognised. These hidden errors are not caused by the algorithms but by insufficient knowledge of their application and deficient experience with their use. They are avoidable.

A significant proportion of the methodological contributions to classic mathematical geology are written in an academic environment. New developments are mainly published in journals specialising in mathematical geosciences. However, only in rare cases are they evaluated by engineers and geoscientists working in engineering practices, mining companies, environmental bureaus, governmental agencies or individual consultants.

As a conclusion the following suggestions are offered to developers of mathematical-geoscientific methods, models, algorithms and software and to all academic teachers in the field of mathematical geosciences:


**Acknowledgements** I am grateful to F. P. Agterberg, Ottawa, and B. S. D. Sagar, Bangalore, for the invitation and encouragement to contribute on the occasion of the Golden Anniversary of the International Association for Mathematical Geosciences. Many thanks go to P. Dowd, Leeds (UK)/Adelaide (AU) and D. Little, Swavesey (UK), for an inspiring discussion of the draft and significant comments. I also would like to acknowledge the reviewers for their interesting suggestions to improve the article.

### **References**

Aitchison J, Brown JAC (1957) The lognormal distribution. University Press, Cambridge Andreas D, Voland B (2010) Der Dolerit der Höhenberge –Teil eines eigenständigen Höhenberg-Intrusionsintervalls – sein Gesamtprofil in der Bohrung Schnellbach 1/62 und die Einordnung der Intrusion in den Ablauf der Rotliegendentwicklung des Thüringer Waldes [The dolerite of Höhenberge—discrete part of the Höhenberg intrusion intervall—profile of the drilling Schnellbach 1/62 and stratigraphic position of the intrusion in the Rotliegend]. Beitr Geol Thüringen NF 10:23–82


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **Chapter 41 Mathematical Geology by Example: Teaching and Learning Perspectives**

**James R. Carr**

**Abstract** Numerical examples and visualizations are presented herein as teaching aids for multivariate data analysis, spatial estimation using kriging and inverse distance methods, and the variogram as a standalone data analytical tool. Attention is focused on the practical application of these methods.

### **41.1 Introduction**

An oxymoron. Mathematical geology has been characterized as such. Saying so, though, betrays ignorance, not of mathematics, but of geology. The science is inherently numerical. Minerals, for example, are quantifiable based on specific gravity, hardness, Miller index, and abundance. Rock classification in petrology and petrography is inherently dependent upon mineral frequency, determined in a manner identical to that which is used by the hematologist when classifying specimens of blood. Geologic structures are quantified by strike and dip, even abundance when characterizing the integrity of rock masses. Economic geologists and geochemists develop complex databases of samples, each associated with many elements, the analysis of which provides clues to ore genesis, water origin, environmental stresses, and rock classification, to name but a few applications. Geophysics and remote sensing provide enormous sets of numbers visualized as digital images. Far from being an oxymoron, mathematical geology is broadly defined as the application of theoretical and applied mathematics to the assessment of geologic data to aid in the interpretation of earth evolution.

The word, aid, is not chosen carelessly. No equation, no calculator, no computer, can substitute for the human ability to infer and interpret. Where equations, calculators, and computers can help with geologic interpretation is in the conversion of numbers to pictures, such as the case when converting numbers comprising a digital

J. R. Carr (✉)

Department of Geological Sciences and Engineering, University of Nevada, Reno, Reno, NV 89557, USA e-mail: carr@unr.edu

<sup>©</sup> The Author(s) 2018 B. S. Daya Sagar et al. (eds.), *Handbook of Mathematical Geosciences*, https://doi.org/10.1007/978-3-319-78999-6\_41

image into what mimics a photograph on a computer screen. Scientific visualization is the process of converting numerical information of any kind into a picture, hopefully improving its interpretation. The responsibility of interpretation remains, always, with the human analyst.

There tends to be an element of distrust of numbers. The quote, attributed to Benjamin Disraeli, is well known, "There are three kinds of lies: lies, damned lies, and statistics." Apparently, there is uncertainty regarding whether Disraeli actually made this quote. This saying was, however, widely used by the end of the 19th Century. Mark Twain, for example, writing in 1906: "Figures often beguile me, particularly when I have the arranging of them myself; in which case the remark attributed to Disraeli would often apply with justice and force: 'There are three kinds of lies: lies, damned lies, and statistics."—*Autobiography of Mark Twain*.

Perhaps mistrust of numbers is not as accurate as saying that there exists a reverence of numbers due to a fundamental insecurity about mathematical understanding. The presentation of a statistical analysis can be quite intimidating to those whose confidence in understanding the analytical methods is weak. Of course, the weak confidence can be taken advantage of by those less scrupulous, stating interpretations of numbers for which there is no clear justification. Thus the skepticism surrounding statistics—lies worse than damned lies.

Despite this ignorance, statistical analysis of data is the most widely applied mathematical method in the geological sciences. Geologists draw maps, with geostatistics, geographic information systems (GIS), and remote sensing fundamentally contributing to the process. Mine *geologists* are increasingly charged with ore reserve estimation and ore control using geostatistics. Other examples of applied statistics included bivariate and multivariate methods important for understanding the correlation between two or more variables. Other numerical methods of importance to geologic understanding are finite difference modeling for understanding ground water flow, geostatistical simulation for modeling uncertainty of spatial data, time-series (Fourier) analysis for identifying cycles in data strings over time or space, linear algebra for modeling landform and geologic structure morphology, fractal geometry for understanding scaling in geologic processes, and the application of neural networks to the modeling of geologic processes.

Some of these applications have proven less interesting to students of mathematical geology than others. Three and a half decades of teaching applied mathematics to earth scientists and engineers at a hardrock mining school provide the backdrop for the following observations. One, *graduate* students of economic geology, moreover economic geology professionals eagerly seek instruction and advice in multivariate methods applied to rock geochemistry data, with an emphasis on interpretation for a better geologic understanding of ore deposits. These students and professionals typically want a pure course on multivariate data analysis. Two, teaching kriging theory in an *undergraduate* course is a waste of time when the heavily parametric practice of spatial estimation is considered; industry often views universities as workshops for training mine geologists and engineers on the use of a particular choice of mine planning software, such as SURPAC, and teaching how to use the software and what choices to make for parameter definition is more than one course can provide. These students are less interested in kriging theory than they are in how to interpret a variogram, how to design a grid, what type of estimator to use, and more. The challenge in this case is how to answer these and other questions while not overstaying one's welcome when explaining theory. Thirdly, the variogram is popular among students and professionals as an analytical tool when kriging is not the primary goal; students and practitioners of remote sensing, in particular, use the variogram for various types of digital image processing.

Multivariate data analysis, the *practice* of kriging, and the variogram as a stand-alone data analytical tool are presented in this chapter with an emphasis on their teaching. Both teacher and student perspectives are presented to balance the discussion between tips for learning and advice for teaching.

### **41.2 Multivariate Analysis of Geochemical Data**

During the 1930s decade, psychologists began to apply principal components methods to help with the interpretation of their data (e.g., Hotelling 1933; Young 1937). Many psychologists collect data on patients characterizing their behavioral traits. Principal components methods allow psychologists to group patients of similar behaviors resulting in a better understanding of them. Three decades later, sedimentologists (e.g., Imbrie and Purdy 1962; Klovan 1966) used principal components analysis to group samples of sediment based on sedimentological characteristics. In this case, the sediment sample is analogous to the patient and the sample characteristics are analogous to behavioral traits. How sediment samples group can be an indication of sediment source, depositional environment, composition, or some other condition of importance to geologic interpretation. A collection of papers published in 1983 (Howarth 1983) reviewed the application of multivariate analysis to geochemical prospecting. Tomes written on geochemistry (e.g. Albarède 1995) often discuss multivariate analysis applied to the interpretation of geochemical data.

Many mathematical methods have been developed to help with the analysis of multivariate data. An important goal of each of these methods is a reduction in the number of variables to enable a more efficient understanding. If there are M original variables, in other words, a smaller number of variables, B, is sought that define a lower multivariate sub-space. Then, the original M dimensional data are projected onto the lower sub-space to yield a plot (graph) that is visually inspected to appreciate data similarities and differences. For students and teachers alike, the ultimate goal of multivariate analysis is the creation of these plots, the study of which motivates *subjective* conclusions about data associations (Greenacre 1984).

In order to develop the plots, some mutually orthogonal coordinate system is needed. Many of the mathematical methods used to analyze multivariate data involve the reduction of the original data information into some matrix that is eigendecomposed to obtain eigenvalues, each associated with a unique eigenvector. The eigenvectors are mutually orthogonal. Moreover, these eigenvectors define the lower dimensional sub-space. They are the principal components of data intercorrelation information.

Can multivariate data analysis be taught without explaining, or at least reviewing, eigendecomposition of data? Answering yes leads to a teaching approach that treats the multivariate analytical algorithm as a black box. This approach relieves teachers of the chore of explaining a method that many students abhor. Undergraduate students, and even graduate students in some cases, are likely to skip class when eigenvalues and eigenvectors are to be discussed. Modern students are quick to dismiss that which they do not like, or find boring. For example, based on the experience of co-teaching mineralogy for the past five years, a discussion of crystallography often results in a rather empty classroom. A mental laziness is betrayed by students' behavior in this regard. It frustrates teachers wanting students to achieve an understanding of analytical methods deeper than the data in–data out black box.

Of course, answering no to the foregoing question and teaching multivariate data analysis outside the black box is confounded by the same student attitudes. Their learning cannot be forced. It can, however, be enticed by numerical examples that are straight-forward, explained in class, and reinforced by extracurricular calculations. Students can be shown that an understanding deeper than black box mysticism is relatively easy to achieve. What follows is a demonstration of this concept and is intended as an aid to instruction. Student understanding can be assessed by substituting the starting data table with alternative data.

## *41.2.1 Numerical Insight to Multivariate Data Analysis*


4 37 0.5 24 17 7 5 33 0.3 14 13 5 6 12 0.4 21 29 5 7 12 0.4 13 19 5

Geochemical data from seven rock samples, each characterized by five elements (variables), are presented in the following table:

*Note* values for Au, Ag, Cu, Pb, and Zn are in ppm

These data represent a five-dimensional variable space. The goal is to determine the eigenvectors for these data, the sub-space that will be used for plotting. A theorem presented by Eckart and Young (1936) holds that any real valued data matrix can be represented as the following product, [data] = [R-mode eigenvectors] [eigenvalues][transposed Q-mode eigenvectors]. In this case, Q-mode multivariate analysis is focused on the relationships among samples. R-mode multivariate analysis is one that focuses on the relationships among the variables.

To obtain the Q-mode result, the data matrix is multiplied by its transpose in the following order, [transposed data matrix][data matrix], to yield a square matrix as follows: [data]T [data] = [Q-mode]:


The result of this multiplication is a square matrix, M × M in size, and M is the number of original variables, 5 in this case. This square matrix is the one from which eigenvalues and Q-mode eigenvectors are obtained [software is necessary for eigendecomposition, ironically rendering this step as a black box]:

$$\begin{array}{rcl} \text{Eigenvalues:} & \begin{pmatrix} 91.8 & 0 & 0 & 0 & 0\\ 0 & 25 & 0 & 0 & 0\\ 0 & 0 & 10.1 & 0 & 0\\ 0 & 0 & 0 & 4.8 & 0\\ 0 & 0 & 0 & 0 & 0.2 \end{pmatrix} \\\\ \text{Eigenvectors:} & \begin{pmatrix} 0.655 & -0.73 & -0.10 & -0.17 & -0.007\\ 0.011 & 0.005 & -0.01 & -0.17 & 1.0000\\ 0.500 & 0.228 & 0.336 & 0.765 & -0.008\\ 0.536 & 0.618 & -0.47 & -0.33 & -0.017\\ 0.183 & 0.181 & 0.809 & -0.53 & -0.007 \end{pmatrix} \end{array}$$

The eigenvectors are loaded column-wise. The eigenvalues are loaded into a matrix along the diagonal. All off-diagonal entries in the eigenvalue matrix are zero. These eigenvalues are actually the square roots of those computed directly from the R-mode matrix because the original data matrix is squared when multiplied by its transpose. By performing this multiplication to yield a square, symmetrical matrix, the eigenvectors from which are guaranteed to be orthogonal to one another. For example, if the first two eigenvectors are multiplied together, the result should be precisely zero:

$$13.59 \times -4.04 + 0.063 \times 0.028 + 2.74 \times 1.26 + 2.94 \times 3.42 + 1 \times 1 = 0.005 \approx 0.1$$

and the result would be precise if not for round-off error.

Working toward the goal of plotting samples 1 through 7, a first step involves multiplying the eigenvector matrix to the eigenvalue matrix:


The next step involves multiplying this resultant matrix by the original data matrix. Because the fifth column of this matrix represents values of practically zero, only the first four columns are used to obtain four factors:



41 Mathematical Geology by Example … 837

The word, factor, heading each column is one of the new variables within the sub-space of the original data matrix. The numbers to the left are the sample numbers used in the original data able. The factors represent an orthogonal coordinate system to enable plotting these seven samples to determine their relationship to one another.

Relative significance of each factor with respect to the total data information is determined by summing the eigenvalues, then dividing each eigenvalue by this sum to obtain a proportion. The five eigenvalues sum to 131.9. Factor 1, for instance, represents 100 × (91.8/131.9) = 70% of the original data information content. The second factor associated with an eigenvalue equal to 25, incorporates 20% of the original data information content. If the seven samples are plotted using the first two factors, then the resultant plot represents 90% of the original data information content. This plot is shown in Fig. 41.1.

Notice that samples 2, 4, and 5 plot in the negative region with respect to Factor 2. These three samples are associated with the highest gold values. But, these samples are among the lowest for silver, lead, and zinc. Samples 1 and 6 are much higher in lead and zinc, but much lower for gold. Each factor is a function of all five of the data variables, Au, Ag, Cu, Pb, and Zn. For example, in the foregoing matrix multiplication involving the original data matrix, the Factor 1 "coordinate" for sample 1 is equal to:

$$15 \times 60.13 + 0.4 \times 1.01 + 21 \times 45.9 + 15 \times 49.2 + 15 \times 16.8 = 3151.0...$$

In reviewing this calculation, notice that it is:

Au −value × 60.13 + Ag −value × 1.01 + Cu − value × 45.9 + Pb− value × 49.2 + Zn −value × 16.8.

Literally, the coordinate of a sample in any of the factors is a function of all the original variables, not just any one, or two. Because of this, the way samples plot in

**Fig. 41.1** A plot of the seven data samples with respect to factor 1 (horizontal axis) and factor 2 (vertical axis). Sample numbers are shown near each plotting symbol

Fig. 41.1 reflects their similarity over all variables. A grouping of samples, then, suggests a rock-chemistry similarity that likely has importance to the interpretation of ore genesis.

To obtain the R-mode result, the original data matrix is once again squared, but in a different order of multiplication: [data][data]<sup>T</sup> = [R-Mode]. Following the same steps above for the Q-mode result, resulting factors for plotting of the five variables are:


The relative importance of each factor with respect to original data information content is the same as for the Q-mode result because the eigenvalues are identical. Figure 41.2 presents a plot based on the first two factors.

Figure 41.2 suggests that gold (Au) is not closely associated with any one of the four other variables. Focusing only on factor 1, gold (Au) and silver (Ag) are on opposite sides of the horizontal axis. Often, variables plotting as such are inversely related; when one is higher in value, the other is lower in value. Further with respect to factor 1, zinc (Zn) is closer to silver and lead (Pb) and copper (Cu) are closer to gold. If, however, the focus is solely on factor 2, then gold and lead appear to be inversely related.

Software is necessary for larger data sets. Using this example and challenging students to follow it for data sets other than that which is used will not necessarily guarantee a deep understanding. But, when reviewing the output from multivariate software, students will have a general understanding of what happens to the input data and the jargon inherent to the method. Knowing why eigenvectors (factors) are

**Fig. 41.2** A plot of the variables based on factor 1 (horizontal axis) and factor 2 (vertical axis). Variable labels are shown next to each plotting symbol

used for developing plots gives students greater confidence when interpreting these plots.

An actual multivariate data set consisting of about 1000 samples, each associated with 50 elemental variables, is analyzed using the multivariate method known as correspondence analysis (Benzecri 1973). This multivariate method is the one preferred by the author for actual data analysis, but its mathematical presentation is not as straightforward as that presented above for principal components analysis. It is the opinion of the author that correspondence analysis yields plots that separate the data better than other methods. The result is shown in Fig. 41.3 for variables only (to reduce the clutter of the plot).

How elements are related is interpreted from Fig. 41.3. Manganese (Mn) is polarizing with all other elements plotting away from it along factor 1. Rocks higher in manganese are inferred to be much lower in the other elements. Given that the likely manganese mineral in this deposit is MnO (wad), moreover knowing that this mineral is black and sooty, could be useful knowledge in the field of where ore is, or is not, present. Factor 2 separates barium (Ba) from the precious metals. The likely barium mineral is barite, an easily recognized mineral in the field if crystalline. This element, too, may be useful for the approximate delineation of the ore zone in the field based on visual inspection.

**Fig. 41.3** An R-mode plot of a multivariate geochemical set of data characterizing an ore deposit. The relative importance of each factor is indicated in the axis label. This plot was created by software that is presented in Carr (2002)

### **41.3 Geostatistics and Its Myriad Parameters**

Decisions. A teacher of geostatistical estimation can spend weeks teaching geostatistical theory, broadly so by including polygonal and inverse distance strategies in addition to kriging. Weeks! Then faced with teaching the *practice* of estimation. The theory is complex, particularly in the case of kriging. And, yet, the outcome is highly vulnerable to the parameters selected for implementing theory. Figure 41.4, whereas not intended to be comprehensive, presents many of the decisions that must be made by a geostatistician when practicing the gridding of data.

A teacher can spend more or less time on geostatistical theory, lesser for undergraduates perhaps. Time, however, must still be devoted to explaining about and advising on the parameters that are necessary to estimation. Moreover, the


**Fig. 41.4** Gridding a set of spatial data requires selecting the estimation algorithm for gridding, then defining parameters unique to the estimation algorithm. How to treat the data, raw or transformed, is another important decision. Likewise, the geometry of the grid must be designed influence of one or more of these parameters is best appreciated by visualizing estimation outcomes.

Estimation outcomes are visualized as color contour maps in the following demonstration. A collection of 2,500 mercury values were collected to assess the severity of site contamination after a flood. The variogram for these data is shown in Fig. 41.5.

The shape of the variogram in Fig. 41.5 is modeled using a spherical variogram model. This variogram shape is the most commonly observed for spatial data, regardless of the spatial phenomenon under study. The model is explicitly defined by setting the parameters for the nugget (found by extrapolating the calculated variogram backwards to intersect the y-axis at h = 0), the sill, that value of the variogram that is more or less constant once the range (of spatial correlation) is reached. In this example, these parameters are: nugget = 20 (rounded), sill = 117 (rounded), and the range is 90 m. Other parameters used in the following data visualization demonstration are as follows: (1) no data transform; (2) block support; (3) ordinary kriging; (4) general, isotropic search strategy with a radius equal to one-half the variogram range; (5) up to N = 10 nearest neighboring samples used for estimation; (6) inverse distance (power term = 1) and inverse distance squared (power term = 2) weighting presented for comparison to the kriging outcome; (7) grid parameters: 50 rows, each with 50 columns, spacing between rows and columns is 10 m. Outcomes are presented in Figs. 41.6 and 41.7.

A lower nugget value is seen to yield lesser smoothing during estimation (Fig. 41.6). A larger nugget yields more smoothing. With inverse distance methods, the larger the power term is, the less smoothing that results during estimation. The aesthetic appeal of a map is a subjective assessment. The amount of smoothing controls the complexity of the map. If larger scale aspects of a spatial region are of more interest than smaller scale aspects, then more smoothing should be used during estimation to downplay the smaller scales. On the other hand, if the desire is to visualize spatial variability down to the smallest possible scale that is allowed by the data, no to minimal smoothing should be used during estimation.

**Fig. 41.5** Variogram for 2,500 mercury values. The jagged line is the actual calculation outcome. The smooth, continuous line is a model fit to the calculation outcome. The model, in this case, is the spherical variogram model and its parameters, nugget, sill, and range, are listed above the variogram

**Fig. 41.6** Three visualizations from kriging. Top, left map is based on the variogram parameters, nugget, sill, and range, that are listed in Fig. 41.5. Top, right map is based on the same variogram parameters, except the nugget value is set equal to zero. The bottom, left map is based on the nugget value set equal to the sill value; in this case, the outcome is a simple average estimation. Integer labels are used for the contour lines to indicate relative value from smaller, 1, to larger, 10. Color also indicates relative value from lower, blue, to higher, red

Indeed, there are similarities among these maps. Each map shows a zone of higher mercury values in the center, and two low zones at the left-center and top-center. These regions are associated with a higher density of spatial samples. Regions of the map that change appreciably when estimation parameters are changed are more sparsely sampled. The spatial distribution of mercury samples is shown in Fig. 41.8.

Differences among the contour map outcomes are noteworthy for spatial locations associated with sparser sampling. Moreover, these differences are more easily observed when increased smoothing is used during estimation.

**Fig. 41.7** Outcomes from inverse distance squared weighting (left map) and inverse distance weighting (right map). The higher the power term is the lesser is the smoothing. This outcome is similar to decreasing the nugget value in kriging

**Fig. 41.8** A map of the spatial locations of 2500 mercury samples within a 250,000 m2 region

### **41.4 The Variogram as a Stand-Alone Data Analytical Tool**

Kriging is not necessarily the ultimate goal of geostatistical analysis. The variogram as a stand-alone data analytical tool has a variety of uses that are independent of estimation. Examples are many and include noise isolation, texture classification of digital images, and self-affine fractal analysis and modeling. The concept of digital image texture is chosen for demonstration.

Four textures are illustrated in Fig. 41.9: water, playa, alluvium, and sedimentary rock outcrops.

Water and playa textures are similar, differing only in reflectivity. Variograms of these textures are likewise similar and indicate a predominant spatial randomness

**Fig. 41.9** The center mosaic shows four textures extracted from a Landsat TM image, clockwise from top, left: alluvium, playa, water, and sedimentary rock outcrops. Variograms for these textures are likewise arranged

with little underlying signal. The variogram of alluvium texture indicates an underlying spatial structure that is heavily masked by noise (randomness). Unlike these other textures, sedimentary rock layer texture is predominantly signal. The variogram of this texture reveals the strong spatial structure and very low noise.

Digital image classification is a process that depends on automatic identification of classes, features on the ground, based on some form of signature, or characteristic for each class. The histogram of pixel values is one such signature that is often used when basing classification on pixel value. The variogram is a signature that is useful for classifying the texture of ground classes. The foregoing demonstration shows that variograms do differ for ground classes, but in ways that are not directly relatable to pixel values. Playa and water, for instance, are distinctly different in brightness, yet their variograms are similar in shape. The variogram has been used with considerable success for the classification of microwave images (e.g., Carr and Miranda 1998; Miranda et al. 1998). These images are inherently noisy due to microwave frequency additions and cancelations that impart what is known as speckle. The classification of texture using variogram signatures applied to less noisy images, such as those from the Landsat satellite, has not been extensively tested.

In the foregoing example, the images of alluvium, playa, water, and sedimentary rock are 100 × 100 pixel extracts from a band 3 (visible red) Landsat 7 image of

**Fig. 41.10** A band 3 (visible red) extract from a complete Landsat 7 scene, Path 39, Row 35, acquired on September 25, 2000

southern Nevada, U.S.A (Fig. 41.10). This image was selected for its varied textures.

The variogram signatures shown in Fig. 41.9 were applied to this image for the classification of texture. The understanding of what constitutes texture in a digital image takes some time to develop. Texture is not brightness, per se, but rather the unique patterns exhibited by groups of pixels. The outcome of textural classification is shown in Fig. 41.11.

The predominant texture seen in Fig. 41.11 is that of alluvium. The texture of water is not unique and is confused with the texture of alluvium. The texture of the shoreline of the lake is identified as playa. This lake (Lake Mead, Clark County, Nevada, U.S.A.) is an artificial reservoir that has a fluctuating water level that leaves an almost pure white calcium carbonate staining on the shoreline. Like playa sediments, the reflectivity of this material often saturates the satellite sensor resulting in identical textures. Sedimentary outcrop texture was often confused with alluvium, and shadows (northwest facing slopes) were often confused with water. Given that this image is of a harsh, arid environment (precipitation is less than 7 cm per year), the predominant alluvium texture makes sense.

**Fig. 41.11** Outcome of texture classification based on variograms applied to the Landsat image shown in Fig. 41.10. Colors represent: water (red), playa (green), alluvium (blue), and sedimentary outcrops (yellow)

41 Mathematical Geology by Example … 847

### **References**

Albarède F (1995) Introduction to geochemical modeling. Cambridge University Press


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **Chapter 42 Linear Unmixing in the Geologic Sciences: More Than A Half-Century of Progress**

**William E. Full**

**Abstract** For more than a half-century, scientists have been developing a tool for linear unmixing utilizing collections of algorithms and computer programs that is appropriate for many types of data commonly encountered in the geologic and other science disciplines. Applications include the analysis of particle size data, Fourier shape coefficients and related spectrum, biologic morphology and fossil assemblage information, environmental data, petrographic image analysis, unmixing igneous and metamorphic petrographic variable and the unmixing and determination of oil sources, to name a few. Each of these studies used algorithms that were designed to use data whose row sums are constant. Non-constant sum data comprise what is a larger set of data that permeates many of our sciences. Many times, these data can be modeled as mixtures even though the row sums do not sum to the same value for all samples in the data. This occurs when different quantities of one or more end-member are present in the data. Use of the constant sum approach for these data can produce confusing and inaccurate results especially when the end-members need to be defined away from the data cloud. The approach to deal with these non-constant sum data is defined and called Hyperplanar Vector Analysis (HVA). Without abandoning over 50 years of experience, HVA merges the concepts developed over this time and extends the linear unmixing approach to more types of data. The basis for this development involves a translation and rotation of the raw data that conserves information (variability). It will also be shown that HVA is a more appropriate name for both the previous constant sum algorithms and future programs algorithms as well.

© The Author(s) 2018 B. S. Daya Sagar et al. (eds.), *Handbook of Mathematical Geosciences*, https://doi.org/10.1007/978-3-319-78999-6\_42

W. E. Full (✉)

GXStat, LLC, 1321 Farmstead, Wichita, KS 67208, USA e-mail: bill@GXStat.com; BillFull@cox.net

### **42.1 Introduction**

Unmixing algorithms and programs have been used to solve many different types of geologic problems for more than 50 years. This approach has been developed by geologists for geologists and has been recently 'borrowed' by professionals in other fields. For the most part, the International Association for Mathematical Geosciences' publications Journal of Mathematical Geology (later renamed Journal of Mathematical Geosciences) and Computers & Geosciences have been the venue for the papers describing the developments and computer codes associated with the approaches described in this report. The history of linear unmixing tied to these papers is the topic of this manuscript along with extending the mathematics to make this approach more appropriate for more common types of geologic and petroleum data. The most recent name for these algorithms is Hyperplanar Vector Analysis (HVA)—a name that will be shown to be more appropriate than the other algorithms/program names that have been used in the past.

### **42.2 History of Constant Sum HVA**

### *42.2.1 Determination of the Number of End-Members*

The rudiments of HVA started with a report to the Office of Naval Research by Imbrie (1963). In this report, the application of the cosine-theta similarity matrix was defined for the Q-mode factor analysis portions of HVA that were to follow. The cosine is used as a similarity index between two samples (Fig. 42.1a). When the angle between two samples approaches 0.0 (cosine approaches 1.0), the ratio of the two variables are assumed to nearly the same. Conversely, when a cosine approaches 0.0 (Θ = π/2 radians), the two samples are considered very different from each other. In statistics, a cosine value of 0.0 would consider the two samples to be independent of each other. While the Imbrie (1963) approach never calculated a cosine function, it did accomplish the same thing by working with the unit vectors of each sample and with the unit sphere defined by these vectors which was subsequently rotated via an eigenvector rotation. The resulting matrix is the cosine-theta matrix defined for all the samples. Figure 42.1b shows the case where two vectors of differing length would produce a cosine Θ that would indicate that the two vectors would be the same as two vectors of exactly the same length. The constant sum approach assumes that the raw data represents vectors of equal length.

Working with vectors on the unit sphere is one of the fundamental differences between what we have been calling vector analysis and traditional factor analysis. Figure 42.2a illustrates the concept of a unit vector while Fig. 42.2b shows a cross-section of the unit sphere in two dimensions. In traditional factor analysis, in simplified terms, before the eigenvector rotation is performed, the mean of either the raw data or transformed data (usually the z-transform) is subtracted from the

**Fig. 42.1** Example of the cosine as a measure of similarity where two samples are very similar to each other in terms of the ratio of the defining variables (**a**), and where the two samples are more dissimilar than the previous two samples (**b**). With constant sum models, both set of vectors would be considered as essentially the same

**Fig. 42.2** Every sample (row of data) can be considered a vector. The unit vector is the direction of this vector where the length of the unit vector is exactly 1.0 (**a**). The collections of the sample unit vectors are located on the unit sphere whose radius is 1.0 (**b**)

variance (or covariance matrix). This step in the procedure is a translation of the axes defining the system (Fig. 42.3). Figure 42.3 also shows in 2-dimensions that the use of the cosine-theta similarity approach does ultimately define eigenvectors and eigenvalues relative to the center of the unit sphere. It should be pointed out that using the approach of Imbrie (1963), the total variability (sum of squares of each coordinate in the space defined by the unit sphere) before and after the eigenvector rotation is simply the number of samples (N). If we have 45 samples, we will have variability in the unit sphere of 45.0. A FORTRAN-IV computer program to perform this procedure was published by Klovan and Imbrie (1971) and was named CABFAC (Columbia and Brown Factor Analysis). Unfortunately for a generation of students and practitioners, the terminology used in this and several of the subsequent programs was rooted in factor analysis.

The next step in the evolution of HVA was taken by Miesch (1976a, b). Miesch realized that the CABFAC program was really a combination of linear algebra and geometry. The eigenvector rotation defined by the previous authors was actually

**Fig. 42.3** In traditional PCA or factor analysis, the subtraction of the mean is performed before the eigenvector rotation and is a translation of the axes to the center of the data (**a**). Of course, in a standard PCA or factor analysis, we would divide each value by the standard deviation of the corresponding variable. In contrast, the Q-Mode analysis defined by Imbrie (1963) defines the center of the unit sphere as the point of reference for the eigenvalue rotation (**b**)

capturing the geometry of the data on the unit sphere. This fact, in conjunction with the observation that with constant sum data the raw samples must fall on either a line (2-D), plane (3-D) or hyperplane (n-D), was a fundamental concept for Miesch. This was a different viewpoint about constant sum data than that reported by Chayes (1971). Miesch concluded that CABFAC can be used to tell us the real dimensionality of the data (must be less than or equal to the number of variables) and that with some additional programming, the end-members and relationships between these end-members and each sample (proportions) can be defined. Programs were created and published by Klovan and Miesch (1976) called EXTENDED CABFAC and QMODEL. These two programs, while still using the standard terminology of factor analysis, represented the foundation of the vector analysis unmixing approach that is used to this day. As a matter of fact, rotation procedures such as the orthogonal VARIMAX rotation (Kaiser 1958) are still performed in the programs.

Before we continue with the QMODEL evolution, a discussion of the ways that EXTENDED CABFAC helps us determine the number of appropriate dimensions to choose which is, in reality, the number of end-members present in the data. CABFAC presents us with several ways of defining the exact number or range of end-members that may be present in the data. Note that CABFAC does not tell us anything about what they look like—or the proportions relating these end-members to each sample. For the sake of this discussion, a data set was created wherein four end-members were mixed in known proportions. While the end-members were not constant sum (the sum of each end-member was not the same value), the collection of these data can still be informative, especially when we discuss non-constant sum analysis. The four end-members were taken from NURE stream sediment geochemical samples (Smith 1997) and this data set. For this section on constant sum algorithms, each sample in the data was transformed to a constant value of 1.0 before being submitted to CABFAC/SAWVEC/VECTOR/PVA routines.

The traditional approach used in the past is the scree plot (Fig. 42.4a). In this plot, the user looks for a break in the slope and then interprets this point as the maximum number of end-members present in the data. Note that like real data, Fig. 42.4a shows a case where the scree plot need not behave in an ideal sense. Miesch (1976a, b) recognized that since we are looking at how well the constant sum plane or hyperplane 'fits' the original data, back-calculated values from a **reduced** space defined by fewer than n eigenvectors can be directly compared to the variables defined in the raw data or **real** space. This back-calculation simply reverses the mathematics using a reduced number of eigenvectors 'back' into the raw data metric via matrix algebra. The comparison is made via the coefficient of determination (CD) function (Draper and Smith 2014) and the CD for each back-calculated variable to the original raw data for a given number of retained eigenvectors is plotted (Fig. 42.4b). Similarly, for each sample, total amount of original variability retained for a given number of eigenvectors is also calculated. This ratio is called the communality for a given sample and is the amount of variability retained by the reduced space divided by the total variability represented by that sample in real space. Figure 42.4c presents a few communality trends for arbitrary samples picked from the test data set. The collection of communalities for a given number of retained eigenvectors can be scanned to look for anomalies that may represent problematic data or the collection can be binned and plotted to assess the range of problems. In the past, a general 'rule of thumb' was that, scanning the columns of orthogonal coordinates (loadings) from the fewest to the highest number of end-members, the first time that approximately 5% or less of the data had communalities less than 0.99 and the coordinates had values less than 0.5, then that number of end-members was near the upper range for the maximum number of end-members. The reality was that lower communalities might be due to noise, measurement error, recording error, or it might be the hint of an additional end-member(s) which generally meant it could be more difficult for the modeling programs to define. Johnson (1997a, b) used the insight that by looking at plots of the back-calculated variables to the raw variables, further insights can be gleaned especially by those that want to visualize the 'pile' of numbers described earlier. Figure 42.4d displays some of those plots for a single variable. These plots have been called Johnson plots in the programs described later in this report.

Finally, if the assumption is that what is not included is in fact noise, there might not be enough information available that can be used to define any additional end-members. In such a case, the distribution of the variability relative to each 'removed' eigenvector can be examined. This is usually done by looking either at the 'coordinates' of the removed eigenvectors (similar to looking at the principal component loadings in Principal Components Analysis) and using external tools such as JMP Pro (1989–2017). The latest programs create appropriate data tables for this step, and for all of the previous steps with key information, that can be used in ancillary programs that have many more statistical functions and better graphics. One such example might be to examine the behavior of the 'removed' eigenvector coordinates to verify that the 'removed' eigenvectors do not contain meaningful information (i.e. whether they can be considered noise and not pertinent to the

**Fig. 42.4** An example of the scree plot from the test data where the number of eigenvectors retained are plotted against individual eigenvalues (**a**). A plot of the CD's for the test data shows how each variable contributes to the overall choice of the number of end-members (**b**). Communalities for four samples are presented for the range of eigenvectors retained (**c**). Collection of Johnson plots showing the visual fit relative to a single variable as the number of end-members (EM) has increased (**d**)

overall model). The user would have to a priori establish criteria that defined noise in terms of the individual data used and/or by some distribution parameters such defined by mean and standard deviation, for instance.

### *42.2.2 Determination of the Composition of the End-Members and Proportions*

Klovan and Miesch (1976) developed the program QMODEL based on Miesch (1976a) in order to define the composition of the end-members and calculate the proportions relating each individual sample to this set of end-members. Given the choice of the number of end-members normally based on EXTENDED CABFAC, the procedure to define the compositions and proportions (oblique coordinates of the space defined by the end-member axes) is strictly linear algebra. The mathematics used up to this point is well defined in Miesch (1976a). QMODEL was designed to be a data modeling program that required interaction with the user. A discussion of these approaches and other alternatives can be found in Clarke (1978). There were several ways for this program to define end-members:


For each of the choices in the original QMODEL program, correct choices produced end-members that were realistic (defined by acceptable variables in the raw data space) and by proportions that were between 0.0 and 1.0. Problems arose with many data sets when the raw end-member compositions were unrealistic and/ or the proportions were out of range. This problem is commonly encountered when there are many variables and samples which makes visualization of the location of the potential end-members difficult at best. To that end, new modeling approaches were devised that gave some automation toward the definition of proper end-members and proportions.

Full et al. (1981, 1982) devised two alternative methods that involves an iterative scheme that started with one of the original QMODEL choices above or with fuzzy cluster centers (Bezdek et al. 1984), and then allowed the program to define end-members external to the data, check their proportions for viability, change if needed the set of end-member compositions to the nearest viable location, and repeat the process until either the program shows no convergence or an acceptable solution is reached. The goal was to determine appropriate sets of end-members closest to the data cloud defined by the samples. This may be likened to trying to minimize the area or hyper-area that represents the planar/hyperplanar convex hull defined by the end-members. The computer code, along with some bug fixes to the EXRAWC and EXNORC subroutines, can be found in the appendix of Full (1981). A general discussion of these methods and their applications at the time can be found in Ehrlich and Full (1988). Alternatives to the aforementioned approaches can be found in Leinen and Pisias (1984) and Weltje (1997). Insights into the appropriate applications of these algorithms and recognizing how to detect problems with the underlying model were discussed in Williams et al. (1988a, b, Chaps. 15 and 19). Optimized data binning for continuous distributions that improved the results of these algorithms were presented in Full et al. (1984).

### *42.2.3 The Renaming to Polytopic Vector Analysis*

In the early 1980s, given the changes to the original CABFAC and QMODEL programs, the approach was renamed SAWVEC (South Carolina and Wichita Vector Analysis) and sometimes simply VECTOR. It was the recognition that the algorithms were dominated by vector algebra that prompted the name change. Circa 1990, the exact same approach was further renamed Polytopic Vector Analysis and applied under that name in Evans et al. (1992) and in many of the references mentioned in later in this report. Around this time, Sterling James Crabtree, then at the University of South Carolina, translated the FORTRAN IV code of Full (1981) into the C programming language and developed a Windows interface and ultimately called the program PVA. This program can be recognized by the fact that the first step after starting the program was to resize the introductory window.

The use of the term polytope has been problematic for this author even though the term was used in the original Full et al. (1981) algorithm. The field of polytopic mathematics has been around for over a century and was generally formulized by Coxeter (1948, 1973). Coxeter assumed that a polytope was a geometric construct in 4 or more dimensions with the degenerate cases being the point in 0 dimensions, the line segment in 1 dimension, the polygon in 2 dimensions and polyhedron in 3-dimensions representing polytopes of dimension 0, 1, 2 and 3 respectively. A search of the literature on polytopes shows that this field of mathematics is rich in various definitions of a polytope, depending for instance on whether you are talking about a convex hull in n-dimensions or more complex surfaces as in star-type polytopes. It is clear that for the geologist this can be a confusing landscape to travel through. A simplistic definition would be that a polytope is an n-dimensional geometric figure (n > 3) whose sides are planes or hyperplanes. The implicit assumption is that a polytope has some kind of volume or hypervolume. Henk et al. (1997) even developed equations for calculating this volume or hypervolume for many types of regular polytopes.

If a polytope can be considered as a region of n-dimensional space that is enclosed by hyperplanes (Coxeter 1973), then that causes problems for linear unmixing. If we consider a vector emanating from a point outside that region and look at the potential intersections of that vector with the polytope, the only possibilities for unique points would be if the vector intersected the vertices of the polytope. If the vector intersected a side, there could possibly be two or more points of intersections which would cause havoc with the uniqueness aspects of the unmixing model. The reality is that in the non-constant sum model, regardless of the number of dimensions (end-members), the data fall on a hyperplane when the number of dimensions is greater than 3. As we will see later, it is this fact that the extension of all of the previous algorithms to non-constant sum data can be realized. Because of the confusion associated with the term 'polytope' relative to the understanding of the previously described algorithms, they have been renamed Hyperplanar Vector Analysis (HVA).

### *42.2.4 Review of the Applications of Constant Sum Unmixing*

The CABFAC, EXTENDED CABFAC-EXTENDED QMODEL, SAWVEC, VECTOR, PVA algorithms and programs (henceforth referred to as HVA family of algorithms) have found application in many geologic disciplines. Some of the earliest studies have involved the analysis of size data in both nearshore and lacustrine environments. These include the work of Klovan (1966) and Solohub and Klovan (1970) using traditional sieved size data. Fillon and Full (1984) used specialized equipment to define the size of particles on an individual basis and defined 5 different sources of deep sea sediment. As pointed out in Fillon and Full (1984) and Full et al. (1984), the success or failure of size analysis depends on the optimization of the size data using transforms such as the maximum entropy method.

In the field of grain shape analysis, the heart of the analytic scheme was the constant sum unmixing algorithms described above. The studies included sediment from Monterey Bay, CA (Porter et al. 1979). Brown et al. (1980), Reister et al. (1982), Mazzullo et al. (1982, 1984), Hudson and Ehrlich (1980), Smith et al. (1985), Tortora et al. (1986) and Evangelista et al. (1986, 1994, 1996) looked at sediment distributions along beaches, barrier islands, shelf and abyssal plains. Murillo-Jiménez et al. (2007) examined the sediment from a relatively large region along the southern coast of Baha California, MX. Material from more lithified material was studied by Mazzullo and Ehrlich (1980, 1983) and Civitelli et al. (1992). El-Awawdeh and Full (1996) looked at changes in key morphology in Florida Bay over time. The methods used in those studies were reviewed in Ehrlich and Full (1984a, b) and Zhao et al. (2004).

The biologic morphology and fossil assemblage scientists were early adapters of the HVA family of algorithms. Healy-Williams (1983, 1984) and Healy-Williams et al. (1997) worked with forams, Burke et al. (1986) with ostracodes and Kensington and Full (1994) with scallops. Williams et al. (1988a, b) looked at correlations of foram shapes with isotopic signatures. Assemblages of microfossils were unmixed in Gary et al. (2005) and Zellers and Gary (2007).

A major area of investigation using the HVA family of algorithms deals with environmental science. Detecting contaminates in soils and identifying their sources was reported by Ehrlich et al. (1994), Wenning and Erickson (1994), Doré et al. (1996), Jarman et al. (1997), Johnson (1997a, b), Huntley et al. (1998), Bright et al. (1999), Johnson et al. (2000, 2001), Johnson and Quensen (2000), Nash and Johnson (2002), Nash et al. (2004), Barabas et al. (2004a, b), Magar et al. 2005, DeCaprio et al. (2005), Towey et al. (2012), Leather et al. (2012) and Megson et al. (2014). The Battelle Memorial Institute (2012) has listed PVA in their handbook for determining the sources of PCB in sediments.

The HVA family of algorithms is critical for the field of PIA (Petrographic Image Analysis). The literature includes Ehrlich and Horkowitz (1984), Ehrlich et al. (1984, 1991a, b, 1996, 1997), Ross et al. (1986), Scheffe and Full (1986), Full (1987), Etris et al. (1988), McCreesh et al. (1991), Ross and Ehrlich (1991), Ferm et al. (1993), Bowers et al. (1994, 1995), James (1995), Carr et al. (1996), Yannick et al. (1996), Anguy et al. (1999, 2002) and Sophie et al. (1999).

Igneous rock researchers have also been an adapter of these unmixing algorithms. These include Horkowitz et al. (1989), Stattegger and Morton (1992), Tefend et al. (2007), Vogel et al. (2008), Deering et al. (2008), Barclay et al. (2010), Szymanski et al. (2013), Lisowiec et al. (2015) and most recently by Blum-Oeste and Wörner (2016).

The unmixing of sources of oil using the HVA algorithms has been reported by Collister et al. (2004), Van de Wetering et al. (2015), Abrams et al. (2016) and Mudge (2016). The correlation between stratigraphy and chemical stratigraphic data was explored by McKenna et al. (1988). "Quasigeostopic potential vorticity" was explored in Evans et al. (1992). Mason and Ehrlich (1995) looked at aspects of well logs for basin exploration (1995). Full and James (2015) used the HVA (non-constant sum version) to decompose a large data set consisting of exploration data in order to better assess exploration and exploitation risk. At least two patents have mentioned using the HVA family of algorithms for analysis of the data derived from their process (Shafer and Ehrlich 1986; Nelson et al. 2013).

The above literature is by no-means the entire community of users of the unmixing approach began by Imbrie (1963). There have been verbal reports of researchers doing work with Shakespeare's plays, classifying business reports, analyzing social data and even applying these approached to marketing data. The success or failure of these studies cannot be directly ascertained, but represent some interesting applications.

### **42.3 Non-constant Sum Data and Algorithms**

The previous sections, for the most part, dealt with rows of data whose row sum was the same or very similar for each sample (vector). This type of data is merely a subset of the data commonly encountered in the geologic sciences and, if you want to use the previous algorithms, you have to potentially degrade your data by transforming it to percentages or some other appropriate singular value. Oftentimes, this involves removing the absolute quantity involved with each sample. For example, if you have six glasses and pour into each glass a variable amount of three

**Fig. 42.5** An example of two idealized images that would produce the same smooth-rough distributions in the petrographic image analysis system described in Ehrlich (1991a, b). Note that in image **a**, the porosity would be much greater than image **b** which would greatly affect the calculation of permeability and other petrophysical variables

solutions, some glasses might contain a greater volume and some a lesser volume here the quantity of each solution might be important. The concept of unmixing might still be appropriate but would only be accurately defined in terms of end-member compositions and sample proportions in very special cases that will be discussed below. With petrographic image analysis which heavily uses the unmixing algorithms, two collections of imaged thin sections with vastly different porosities would ultimately have equal constant sum smooth-rough distributions (Fig. 42.5). Petrophysical logs, formation depths, seismic parameters and other petroleum related data are mostly non-constant sum in nature. There are many other types of data where the concept of mixtures and unmixing can be validly applied.

What happens when you try to apply the constant sum programs to inherently non-constant sum data? This topic was partially addressed by Klovan (1981) without addressing the application of determining end-members and proportions using the techniques described by Full et al. (1981, 1984). In his paper, Klovan notes that, if the data can be approximated by a plane or hyperplane parallel to the constant sum plane, then the aforementioned algorithms can be appropriately applied. However, Klovan (1981) acknowledges problems when the surface defined by the non-constant sum data is not parallel to the unit constant sum plane. Some of the problems can be demonstrated by a simple diagram in two dimensions (Fig. 42.6). Note that the midpoint of the non-constant sum segment does not correspond to the midpoint of the constant sum plane which would be the proportions reported for this point by the computer codes. Using some of the usual functions to create constant sum data that are available in the program would not help matters. A more complex series of transformations using trigonometry could be easily developed for 2 or 3 dimensions but would be difficult to visualize and cannot be easily generalized to n dimensions. Also note that Fig. 42.6 represents an example in two dimensions which intersects the two axes making the determination of end-member compositions a bit easier; they would be represented by the end-points of each line and whose compositions would be the raw data points defining these end-points. If end-members needed to be defined beyond the data

**Fig. 42.6** A simplistic example of some of the issues associated with using constant sum algorithms with non-constant sum data. The unit constant sum line is represented by the solid line passing through the points (1, 0) and (0, 1). The non-constant sum data is represented by the solid line at an oblique angle to the constant sum plane. The mid-points (0.5, 0.5) proportion of each line is represented with a symbol. Note that the extended unit vector (represented by the dashed line) that represents the midpoint of the constant sum system is divergent from the same unit vector that passes through the mid-point of the non-constant sum line segment

cloud, the definition of the end-member compositions would be very difficult when there are more than 3 dimensions.

How to deal with the non-constant sum problem was solved in the mid-1980s and has been used in petroleum industry projects and for research projects for the Department of Defense. The code was initially run on a 386-processor with 387-co-processor as well as IBM mainframes. It is only recently that the computer code has been written for Windows operating system with a Windows GUI. The abstract concept behind the approach to dealing with this type of data is to recognize that ultimately any mixing problem deals with data on either a line segment (in 2-d), a plane (2 or 3-d) or hyperplane in more than 3 dimensions. The goal then is to define that hyperplane and translate/rotate the data to a plane/hyperplane that is parallel to the unit constant sum plane where we can apply the usual constant sum approaches. Afterward, any time we want to know what the raw compositions are, we reverse the translation/rotation to bring us back into the original metric. In this way, the earlier approaches are not abandoned but can be efficiently extended to almost any other data that can be modeled as a mixture.

The procedure for this translation/rotation is the following:

(1) Remove the mean from the data. This is equivalent to the first step of principal components (Davis 2002; Draper and Smith 2014). The visualization for this step is that the axes defining the raw data are translated to the mean of the data with no loss of information.


In more simplistic terms, what we have done is to create an NV x NV matrix (NV = the number of variables) that will be used to rotate the raw data in order to create a one-to-one correspondence with a set of points in a plane/hyperplane parallel to a constant sum plane/hyperplane. This matrix was orthogonalized and the application of this rotation and translation results in the loss of no information. Since this is an orthogonal matrix, the transpose of this matrix is the inverse of the matrix and gives us the function to go from the constant sum hyperplane to the raw data. These functions allow for properly defined proportions and end-member compositions whether the end-members are contained in the data or not. Figure 42.7 illustrates what the procedure is doing in general.

The constant sum routines can then be applied as they were before only using the G\* and G\*T matrix defined above to move from the raw data hyperplane to the constant sum hyperplane with no (or minimum loss due to computational error) loss of information. This approach capitalizes on more than a half-century of previous

**Fig. 42.7** A 2-dimensional representation of the procedure to define the G\* matrix procedure described in the text. Note that in 2-dimensions, the first eigenvector defines the direction of the line segment and the second the normal to this segment. The red axes represent the first eigenvector and the normal to the constant sum line. These axes are then translated to the mean of the non-constant sum data cloud defined by the green diamonds. The blue axes represent the first eigenvector and the normal to the non-constant sum line. This set of axes will be orthogonally rotated to the position of the constant sum axes (dotted axes), (i.e., the raw data will be defined by a new set of coordinates). Mathematically, this procedure will not result in information loss

algorithmic and programming experience. Furthermore, the appropriateness of the unmixing model in non-constant sum space can be checked by looking at the set of eigenvalues—data that do not fall on the mixing hyperplane will have a value other than 0.0 for the last eigenvalue. Additionally, by checking the raw data on a sample-to-sample basis with its equivalent location in the constant sum hyperplane via a similar function to the communality will allow the user to examine potentially aberrant data.

As a demonstration sample, using the previously defined test data set, we can compare the end-members and proportions when they are subjected to a constant sum approach (data was transformed to 100%) and the non-constant sum approach. The set of end-members are shown in Table 42.1 and randomly selected proportions for 10 of the original 296 samples are tabulated in Table 42.2. This data set will be made available from the GXStat website (www.GXSTat.com). Note that these data contained the end-members as samples and therefore no iterative schemes such as those described in Full et al. (1981, 1984) were used. It should be noted that, for the most part, the end-members are not that extreme compared to potential test end-members that could have been chosen. Mathematically, this is saying that, with the test data used in this example, most of the variables in the mixing hyperplane lie in portions of that hyperplane which can be modeled as constant sum (i.e. take away the handful of variables that lie in a section of the hyperplane that is most oblique to the constant sum plane, and the data might be able to be modeled using the constant sum algorithm). In the more common case

**Table 42.1** Test data set end-members (TEST EM) with constant sum end-members (CS EM) and the HVA non-constant sum end-members (HVA EM).Note the disparity between the constant sum end-members (central gray area) and the actual model end-members (white area on right of table). Also note how close these actual model end-members and the HVA end-members reflect each other. The variables represent parts per million data of 20 different elements


**Table 42.2** Ten randomly selected samples were picked to show the trends of the proportions from results of the application of the constant sum algorithms and the non-constant sum programs. The a prior proportions for each end-member used to create each test sample is given by the columns ORIG. PROP., the constant sum derived proportions by the grayed columns labeled CS PROP in the center of the table and the results of the non-constant sum application is given by the columns on the right of the table labeled HVA PROP. The average error for the proportions of the HVA results was found to be ±0.0004 which was largely attributed to the fact that the raw data was defined using two decimal-point accuracy. The sample numbers represent the sequence number of the row of the test data set


where end-members need be defined external to the data cloud, the results would have been potentially far off and confused if the constant sum algorithm was applied. Also note that if the user did use the constant sum routines to define the composition of the end-members and either manually extracted the raw data of an internal end-member or the 'nearest' actual point (defined by the raw data) to the external end-member, it would be difficult to know how these points relate to all of the other data samples—the user would simply not know if all the data truly fall on a mixing plane or hyperplane. Finally, because HVA rotates the data to a plane parallel to the constant sum plane, when the data are inherently constant sum, no new program is needed.

Finally, it should be noted that this non-constant sum model will work for any mixing system that can be modeled as a plane or hyperplane. The dimensionality of the hyperplane must be less or equal to than the number of variables otherwise there will not be a unique solution to the end-member and proportions problem. This does bring up the case where a three end-member solution (defined by a triangle) in two dimensions can be solved using these algorithms. The G\* rotation described above can potentially produce a plane or hyperplane that intersects with the origin defining an end-member consisting of the origin with (0, 0, …) as its composition. The interpretation of the origin as an end-member has been successful in previous studies when this situation has been encountered. It can be, however, a tricky proposition depending on the type of data being analyzed. It might be useful to substitute a value close to the origin for the definition of that end-member instead of using the origin as an end-member composition.

Areas of application of this approach have included chemo-stratigraphic data, correlation and mapping of wireline well logs, unmixing of oil compositions preserving volume of source material, determination of various forms of risk in exploration schema, correlating biologic assemblages to seismic stratigraphy, and determination of 'sweet spot' locations for oil exploitation, to name a few. Unfortunately, the results of these reports remain confidential. It is anticipated that these and new applications will be reported in the future in various literature.

### **42.4 Summary**

Fifty years of research and development have given the geologic community a useful tool for the analysis of mixtures. It is anticipated at this time that this approach will last well into the future, especially since the program will be made available to anyone in any field they want. It should be noted however, that there are still untested areas of research in this field. The most appropriate approach for the definition of extreme end-members is still an open discussion. Generally, researchers have been looking at the extremes of the data and not looking so much at the bulk of the data. While much of the variable density of the raw data may be due to localized over-sampling problems (usually, we geologists sometimes just analyze the data we have!), there are other methods such as FUZZY clustering (Full et al. 1984; Bezdek et al. 1984) and algorithms that use FUZZY variables to define data density in terms of sets of point, lines, planes, hyperplanes and various n-dimensional spaces (Bezdek 1981).

Another area that needs some additional work is the definition of new criteria that will allow the various iterative schemes to know when the 'best' solution is achieved, when there might not be a complete convergence. In terms of computer programming, what would be beneficial is to be able to define one or more 'fixed' end-member(s) (the number being less than the original number of chosen end-members) and let the program determine other potentially viable end-members using the DENEG iteration scheme (i.e. one or more end-members want to be fixed in the analysis—the programs have always had ways of externally defining all of the end-members). Additionally, defining how the end-members interact with the modeled environment (such as when a geochemical component reaches a given level and precipitates out of the system) would also be of great use. This has been accomplished in the past by making alterations to the program, recompiling the code and proceeding with the newly built custom program. Being able to run this option without having to recompile would be quite useful. Another item on the wish list would be to convert the program out of FORTRAN IV, although the current program is very fast and FORTRAN has become a versatile programming language. This author acknowledges that there are fewer and fewer people who can program in this language, especially in the Windows environment. A language that has a 'better' future would be of great advantage, especially since the programs and algorithms may be used by a wider audience. Additionally, all of the mathematics needs to be described in one place along with a user manual that describes in detail not only all the options but also the whys and wherefores of particular options. It should be noted that the program has a built-in user manual but does not go into details of the more subtle nuances associated with the algorithms. These missing discussions will be the topic of various discussions available on the GXStat website (www.GXSTat.com). There is even some progress in producing an R version of the program for those who want to incorporate this approach into their projects. This flexibility will be of benefit to a large community of potential practitioners.

Finally, there is something that can be gleaned from the list of references. The access of researchers to the HVA family of algorithms has been somewhat limited by both changes in the computer industry (computer languages and graphic user's interfaces in addition to hardware) and by research association (i.e. who you know). It is for this reason that the complete source code and compiled code for the past algorithms and the HVA code discussed in this report will be made freely available from the GXStat website (www.GXSTat.com) or directly from the author. This, in addition to the test data set and additional research programs such as FUZZY n-Varieties written by this author, will also be made available (in FORTRAN, of course) through this outlet. This open access will allow others to contribute to the mathematics and algorithms, making them even more useful for the next 50 years. **Acknowledgements** Many people have contributed to the development of the HVA family of algorithms and programs. It was one of the intents of this report to give them the due credit. Also credit goes to those that have spent a great deal of their career to the application and dissemination of the unmixing approach. To that end, the grand prize should go to Professor Robert Ehrlich, without whom many would not have known about the diverse applications of this approach. I would like to thank Drs. Magdalena and Nils Blum-Oeste for their comments and improvements on this manuscript and overall support along with pushing the subject material. Dr. Lucinda Brothers-Full also help with editing this manuscript. Finally, I would like to apologize profusely to anyone who felt I purposely left them off the list of references.

### **References**


Zellers SD, Gary AC (2007) Unmixing foraminiferal assemblages: polytopic vector analysis applied to Yakataga formation sequences in the offshore Gulf of Alaska. Palaios 22:1443–1467 Zhao GT, Wei Z, Full WE, Chen Q, Lin YS (2004) Fourier shape analysis and its application in geology. Periodical of Ocean, University of China, vol 34, pp 429–436

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **Chapter 43 Pearce Element Ratio Diagrams and Cumulate Rocks**

### **J. Nicholls**

**Abstract** While this chapter is about Pearce element ratios, I've included some personal reflections as this book is a 50th Anniversary project of the IAMG. Pearce element ratios, Felix Chayes and the Chayes medal, came together on September 11, 2001. As the recipient of the Chayes Medal, I was in Cancún, Mexico on that fateful date to deliver a talk on Pearce element ratios. Pearce element ratios are designed to model processes of fractionation and accumulation in igneous systems. They are frequently used to extract information from analyses of rocks formed from melts produced by fractionation—volcanic suites. Rock bodies formed from the fractionated crystals—the cumulate rocks—have received practically no attention. From the standard paradigm describing the formation of cumulate rocks, based on studies of the Skaergaard Intrusion, one expects a predicted pattern of data points on a Pearce element ratio diagram. Points derived from the mean compositions of the units in the cumulate body should fall up-slope from the point representing the initial melt composition on a diagram that accounts for the cumulate assemblage. Points derived from the compositions of the inferred residual melts present at the beginning of crystallization of a unit in the rock body should fall down-slope from the point representing the initial magma. The distance between a point on the line of a Pearce element ratio diagram and the point representing the initial magma composition depends on (1) the size of the aliquot that crystallized to form the rock unit and (2) the ratio of crystals to melt in the mush that solidified to form the rock unit. Patterns extracted from computer simulations compared to analogous data points from units of the Skaergaard Intrusion indicate that the crystal mushes that formed the units of the Marginal Border Series had a smaller ratio of trapped melt to crystals than did coeval mushes forming the Upper Border Series. Simulation patterns further indicate that the LZa and UZa units of the Layered Series formed from assemblages with larger ratios of melt to crystals than did the respective coeval units, LZa\* and UZa\*, of the Marginal Border Series.

J. Nicholls (✉)

Department of Geoscience, University of Calgary, Calgary, AB T2N1N4, Canada e-mail: jim.nicholls@shaw.ca; nichollj@ucalgary.ca

<sup>©</sup> The Author(s) 2018

B. S. Daya Sagar et al. (eds.), *Handbook of Mathematical Geosciences*, https://doi.org/10.1007/978-3-319-78999-6\_43

**Keywords** Pearce element ratios ⋅ Cumulate rocks ⋅ Computer simulation Skaergaard intrusion

### **43.1 Introduction**

Blue skies and balmy temperatures graced a tranquil world when I entered the lecture room of the hotel-conference center in Cancún, Mexico, venue of the 2001 International Association for Mathematical Geosciences (IAMG) meeting. It was an early Tuesday morning and I was on my way to ensure the equipment worked for the talk I was soon to deliver. I was looking forward to the day and feeling honoured as the recipient of the IAMG Felix Chayes Prize for Excellence in Research in Mathematical Petrology.

When the talk was over and people were thinking ahead to the coffee break and upcoming lectures, we left the lecture room. Up until that moment, we were unaware that the world had changed: hijackers had crashed murder-suicide planes into the World Trade Center in New York City. Those attending the meeting gathered around a TV and watched the horror of the south tower collapse; smoke and dust billowed down the streets of New York, chasing people as they ran for their lives. The north tower collapsed a few minutes later. Hijackers crashed another plane into the Pentagon, and a fourth had been brought down in a field in near Shanksville, Pennsylvania just minutes away from its target in Washington, D.C. It was September 11, 2001, referred to by nearly all as 9/11.

My talk on Pearce element ratio diagrams and their utility in evaluating petrologic hypotheses was largely forgotten, understandably, in the turmoil following the events of that morning. Pearce element ratios and the events of 9/11 have been inextricably linked in my mind since that terrible morning, which is why they come together in this chapter.

Pearce element ratios were conceived in the last century (Pearce 1968), as were the concepts and techniques needed to implement their application. Their defining characteristic is a denominator formed from concentrations of elements that enter the minerals crystallizing from igneous melts in negligible amounts. Pearce element ratios have been used to model the evolution of melts in volcanic systems (see Nicholls and Russell 2016 for recent applications and explanations of the concepts) but they have not seen much service in modeling changes in the concomitant rocks formed from the separated solids and the enclosed interstitial melts: the cumulate rocks. Pearce element ratios can provide insight into the evolution of such assemblages.

### **43.2 Outline of a Cumulate Rock Paradigm**

Petrologists have developed a paradigm for the crystallization of a single magma body in a crustal magma chamber that explains many of the features of layered and cumulate igneous rocks. This paradigm originated in features found in the Skaergaard Intrusion of East Greenland (Wager and Deer 1939; Carmichael et al. 1974; McBirney 1989a, 1996).

Cumulate bodies are often huge; the Bushveld Complex in South Africa has an estimated volume between 370,000 and 1,000,000 km<sup>3</sup> (Cawthorn and Walvraven 1998). Each unit forms by crystallization of a portion or aliquot of the melt in the magma chamber at the time. The larger the unit, the larger the aliquot from which it formed.

A cumulate body can be enclosed by a shell of finer grained rock petrologists are wont to call a chilled margin. The standard inference is that the chilled margin represents the initial magma and that the composition of the chilled margin closely approximates the composition of the initial magma. However, the chilled margin of a large body can be a boundary layer formed by the reaction of the corrosive magma with the country rocks. If so, the composition of the chilled margin can differ from that of the initial magma in a way that depends on the composition of the country rock and on the extent of reaction between magma and country rock. Nevertheless, chilled margins need to be considered as possible samples of the initial magma.

### *43.2.1 The Skaergaard Intrusion*

The Skaergaard Intrusion in East Greenland is one of the most studied rock bodies on the face of the Earth. L.R. Wager discovered the intrusion in 1931 on a scientific expedition. He returned in 1932 on another expedition and again in 1935–36 when he organized and led the third expedition to map and study the intrusion. On this trip, W.A. Deer accompanied him. Publications on the petrology of the Skaergaard began with the report by Wager and Deer (1939). A facsimile of the report was issued in 1952 with a new preface and a list of papers published since the 1939 publication. The list contains 46 references. One can find several hundred references that target the Skaergaard in the literature published after 1952.

I never met Wager but I did meet Deer when he visited the University of California, Berkeley during my time there as a graduate student. On a field trip, he spoke briefly about working with Wager on the Skaergaard. Wager was a mountaineer and climber. In 1933, as a member of the British Expedition to Mount Everest, he climbed to more than 8595 m, setting a record for a climb without oxygen, a record that wasn't bested until 1978. It the preface to the original report, Wager and Deer wrote that the terrain was so demanding that the two-man mapping parties had to traverse roped together, which lends credence to the story in which Deer was reputed to have said he woke scared nearly every day on the Skaergaard because Wager took them up and down cliffs and slopes where Deer would never go himself.

Significant contributions to Skaergaard petrology since the original Wager and Deer report in 1939 have come from Wager (1960), Wager and Brown (1968), Hoover (1989a, b), McBirney (1989a, b, 1996), Ariskin (2002), and Nielsen (2004) among many others. As a result, the Skaergaard Intrusion has become a standard of comparison against which the evolutionary paths of basaltic magmas are measured.

The major units of the intrusion are the Layered Series (LS), composed of relatively horizontal layers, the Marginal Border Series (MBS) composed of relatively steeply dipping layered rocks, and the Upper Border Series (UBS), composed, again, of relatively horizontal layers of rock (Fig. 43.1). The layers in the Layered Series and the Upper Border Series become approximately horizontal after removal of a post-intrusion tilting (McBirney 1989a). The smaller units, Lower Zone a, Lower Zone b (LZa, LZb), etc. (Fig. 43.1) are defined by mineralogical changes. For example, the coeval units of the Middle Zone (MZ, MZ\* and *β*) are characterized by the absence of large, primary crystals of olivine (primocrysts) (McBirney 1989a). Olivine primocrysts occur throughout the rest of the intrusion.

The stratigraphic nomenclature has slightly changed with time. Earlier workers, for example, Wager and Deer (1939), Chayes (1970), Carmichael et al. (1974), and

**Fig. 43.1** Rock units of the Skaergaard Intrusion (modified from Nielsen 2004). The Layered Series is interpreted to have formed by sedimentation of the crystallizing minerals onto the floor of the magma chamber. The Marginal Border Series (Hoover 1989a, b) and the Upper Border Series (Naslund 1984) are thought to have formed by plating of the crystallizing minerals on the walls and roof of the magma chamber. Labels in parentheses are number of analyses used to calculate the mean compositions of the rock units (McBirney 1989a)

Naslund (1984) called the Marginal Border Series and Upper Border Series (UBS) the Marginal Border Group and Upper Border Group. In addition, an asterisk has been attached to the names of the units of the Marginal Border Series to distinguish them from the units of the Layered Series.

The original floor of the magma is not exposed so the first rocks that formed on the floor are not available nor are samples from UZc\* in the Marginal Border Series because of lack of outcrop (Hoover 1989a). The last or nearly last melt in the chamber is believed to have been caught between the crystal mush of solids and trapped melt that solidified as UZc\* and the bottom of the UBS where the youngest unit of the UBS, the *γ* <sup>3</sup> unit, crystallized.

According to the paradigm, the rocks making up the intrusion formed by sedimentation of the crystallized minerals on the floor of the chamber and by plating minerals on the roof and walls. The solid assemblages that formed as sediments and as layers of plated minerals change with crystallization stage as do the mineral compositions. These mineral assemblages and mineral compositions found in the bottom, sides and top of the solidified magma chamber can be correlated and a stratigraphy of mineral assemblages and compositions provide coeval markers of crystallization stage. The rocks making up the intrusion consist of the mineral sediments and plated crystals plus melt trapped between the minerals; the trapped liquids later crystallize, creating intercumulus assemblages that, with the primocrysts, make up the rock units that fill the magma chamber.

Properties not emphasized but usually implicit in this paradigm are the ideas that the initial magma filling the magma chamber is uniform in composition and that the compositions of successive melts in the shrinking chamber maintain uniform compositions. These ideas may not be realistic. There may be compositional gradients as well as temperature and pressure gradients in the melt that induce the density currents that develop sedimentary structures, such as cross bedding, in the crystal mush.

In addition, the sedimentation-plating paradigm fails to account for several features of cumulate rocks, for example, repetition of stratigraphic units in the sedimentary layers (Bons et al. 2015). Mush formation above the magma interface (Bons et al. 2015) and double-diffusive convection in boundary layers (Huppert and Turner 1981; McBirney and Noyes 1979; McBirney 1985) are processes postulated to account for the repetition.

Processes behind the magma-mush front (post-cumulus processes, Sparks et al. 1985) can also affect the mineralogy and chemistry of the phases involved in the evolution of the magma body. These processes include convection in the trapped melt, compaction, and cementation. Cementation could produce significant chemical changes in the cumulate rock. Large, optically continuous crystals (poikilitic crystals) can be found enclosing previously formed primocrysts in both lava flows and in cumulate rocks. In the Skaergaard Intrusion and larger cumulate bodies, an interconnected crystal of pyroxene or plagioclase often fill the interstices between the primocrysts.

One infers the primocrysts were originally enclosed in a melt with the same composition as the melt that filled the magma chamber at the time and that melt was trapped between the primocrysts on the boundaries of the magma chamber. On crystallization of the interstitial melt, a single crystal can grow, fill the interconnected spaces, and displace the trapped melt. The melt, after undergoing post-cumulus processes, will differ in composition from the initially trapped melt. This modified melt could be expelled from the crystal pile and mix with the magma in the chamber, changing its composition. The large poikilitic crystals left behind would be part of the cement that holds the rock together.

Granted that processes in front of and behind the crystallization boundary can affect the resulting cumulate rock, the questions are: how effective are they in changing the rock composition and do they have a detectable influence on the composition of the melt in the chamber?

When the trapped melt crystallizes, permeability decreases, flow of melt from the cumulate mush slows, and its potential to change the composition of the melt in the magma chamber is lowered.

The difference in composition between the trapped melt and the melt in the magma chamber affects the composition of a mix of the two. If the trapped melt differs only slightly from the composition of the melt in the magma chamber, then the composition of a mix will differ from that of the composition of the melt in the magma chamber by a small amount, especially if the amount of trapped melt added to the mix is small.

Melt trapped in the crystal mush close to the crystallization boundary will be close in composition to the melt in the magma chamber. Farther from the boundary, the compositional differences will be larger. However, post-cumulus processes will act to decrease the volume of the trapped melt farther from the boundary. Processes like compaction, adcumulus growth (crystal growth on the surfaces of the primocrysts exposed to the interstitial melt), and cementation.

Expulsion of the trapped melt from the crystal mush could change the chemical composition of the melt in the magma chamber; however, the physical setting and processes could work in concert to keep the changes small.

Magma mixing, magma recharge, and magma mingling are labels for similar if not nearly identical processes. Simply put, the terms label the incorporation of one magma into another. If the invasive magma has a different composition than the original, the final body will have a different composition from the original (Anderson 1976; Carmichael 2004). Again, the effect of mixing on the chemistry of the combined magmas depends on how different the compositions are. The greater the differences, the greater the effect.

### **43.3 Pearce Element Ratio Patterns for Cumulate Rocks**

The data to test any model of cumulate rock formation, Pearce element ratio or otherwise, comes from geologic maps, mineralogy, rock and mineral compositions, and rock textures. The more features of the data a model can predict, the stronger the model. If the model conforms to the data, the model is accepted as a description of the implied process that formed the rocks. If the model does not conform to the data, the model is rejected as an explanation (Nicholls and Russell 2016).

The numerators of the ratios plotted on the rectilinear axes of a Pearce element ratio diagram reflect the chemical changes in the melt-solid system caused by segregation and accumulation (sorting) of a specified mineral assemblage. Specification of the mineral assemblage allows us to create a model such that the compositions of melts and solid assemblages will fall on a line with the model slope. Only one rock composition is needed to locate the model line in Pearce element ratio space. The other analyses in the set of rock analyses can then be used to test the model. The specifics of the model dictate the slope of the line. Usually, the slope is one by design. Consequently, we can talk about up-slope and down-slope directions from a fixed point on the model line. If we select the point representing the chemistry of the melt present when the rock unit begins to form as the fixed point, then a point representing the chemistry of the derivative or residual melt will fall down-slope from the fixed point. Points representing the chemistry of crystal-melt mixtures (crystal mushes) will fall up-slope from the fixed point.

The general pattern expected for data points representing melts from a system undergoing sorting are known (Pearce 1968; Russell and Nicholls 1988; Nicholls and Russell 2016). The details of patterns expected in the data collected from cumulate bodies have not been explicitly investigated. A simple computer simulation of accumulation processes can delineate at least some of the expected patterns. Details of the simulation are described in the appendix.

The results of a simulation for a system with the composition listed in Table 43.1 are shown on Fig. 43.2. The Pearce element ratios plotted on Fig. 43.2 are:

$$(0.8\,\text{Al} + 0.5\,\text{Mg} + 0.4\,\text{Ca})/\text{K}\,\text{versus}\,\text{Si}/\text{K}$$

The diagram was designed to describe the Pearce element ratios in the melts generated by fractionation (loss) of anorthite (CaAl2Si2O8) and forsterite (Mg2SiO4) from the initial melt. The Pearce element ratio coordinates of the initial melt are shown with a black star on Fig. 43.2. The ratios derived from the compositions of the solids plus trapped melt are shown by filled circles.



**Fig. 43.2** Pearce element ratio diagram for crystallization of a simulated system containing Si, Al, Mg, Ca, K, and P. Forsterite and anorthite are subtracted from the initial melt, leaving a residual melt that is trapped in the solid assemblage. Rocks formed by the simulated process would be composed of forsterite, anorthite, and solidified trapped melt (see appendix)

As expected, all the data points generated by the simulation fall on a line with a slope of one. The residual melts do produce points on the line that fall down-slope from the point representing the initial melt. Points representing the compositions of the accumulated solids and trapped melt and do plot up-slope from the point representing the initial melt (Fig. 43.2). These relationships are simply examples of the lever rule of phase diagrams (see Bloss 1994, pp. 304–306).

A second model is shown with a dashed line on Fig. 43.2. If the magma chamber undergoes recharge by a similar but not identical magma, we would expect the same ratio pair to describe the variation produced by crystallization of the second melt. The composition of the second simulated melt that produced the data points shown by squares is listed in Table 43.1. Mixing and crystallization of the mixed melts would produce data points falling between the two model-lines.

If the coordinates of the fixed point on a Pearce element ratio diagram are (*xi*, *yi*), then the distance between the fixed point and another point on the model line with coordinates equal to (*xj*, *yj*) will be given by:

**Fig. 43.3** Plot of distance along the model line with a slope of one (red line in Fig. 43.2) from the point representing the initial melt (star in Fig. 43.2) against aliquot size. The arrows indicate direction of increasing ratio of trapped melt to cumulate solids

$$d = \left(\alpha\_i - \alpha\_i\right)\sqrt{2}$$

if the slope is equal to one and if the points representing cumulate assemblages fall exactly on the model line.

Two quantities determine the distance of a point from the fixed point: the size of the quantity of melt (aliquot) that crystallized to form the unit of crystals plus trapped melt and the amount of melt trapped in the crystal mush. Figure 43.3 shows how distance along the model line, aliquot size, and ratio of trapped melt to solid are related in the simulated system.

The two variables, distance along the model line and aliquot size, work in concert. The two are also quantities that can be extracted from sets of rock analyses and from geologic maps. The relationship between the two can be described by treating the ratio of the amount of trapped melt to the amount of accumulated crystals in a single unit of the cumulate rock body as a parameter. On a plot of aliquot size versus distance from the point representing the melt along the model line, lines of constant ratio of trapped melt to solid in the mush fan across the diagram. The smaller the ratio, the farther the line of constant ratio falls from the x-axis (Fig. 43.3).

Approximations of the amount of trapped melt could be made from estimates of petrographic modes (Chayes 1956; Nicholls and Stout 1986) of intercumulus assemblages versus primocrysts in thin section. However, distinguishing adcumulus growth from original growth material of the primocrysts is sometimes difficult. In addition, modal variations must underlie the large chemical variations found in the units of cumulate rocks (see below, Sect. 43.4). Consequently, petrographic assessment of the ratio of the volumes of trapped melt to primocrysts would require looking at many samples to get a precise value for a unit in the intrusion. At the present time, data to make a quantitative assessment of the agreement between model values and precise estimates of the petrographic modes are not available.

### **43.4 Compositions of Units of the Skaergaard Intrusion**

A challenge to the construction of viable Pearce element models of cumulate rock formation arises from the chemical and mineralogical heterogeneity in the map units. The compositions of the constituent units must be determined as the mean of analyses from different locations in the unit. Mean values of the compositions and standard deviations for each constituent were published by McBirney (1989a, 1996) with data from Naslund (1984) for the Upper Border Series units. Dividing the standard deviations by the square root of the number of samples gives the standard errors of the means; the accepted measure of the uncertainty in a mean value. Standard errors of the means are large compared to analytical uncertainty (compare McBirney 1989a; Wright et al. 1975, p. 117). Analytical uncertainties are often two orders of magnitude smaller than the standard errors of the means. To make the two measures of uncertainty approximately equal, on the order of 10,000 samples would have to be analyzed for each unit.

When evaluating a model by comparing values from the model with the data, we expect certain criteria to be met if the model is successful. When testing models treating volcanic rocks, we expect model values to agree with the analytical data to within analytical uncertainty (Nicholls and Russell 2016; Nicholls and Stout 1988). Implicit in this expectation is the assumption that a sample from a lava flow is representative of the flow itself.

Estimates of the proportional volumes (Nielsen 2004) are shown on Fig. 43.1. The proportions, expressed as percentages of the volume of the intrusion were derived from the geologic maps of the body. It is worth explicitly noting that the quantitative entity plotted on Fig. 43.1 is volume, not thickness as has been traditionally plotted on similar looking graphs. Distances along the parallel lines have no real-world significance. The proportional volumes shown on Fig. 43.1 are not all independent (Nielsen 2004, p. 519). This dependence is revealed on Fig. 43.1 by the straight lines separating Layered Series volumes from the Marginal Border Series volumes and the Marginal Border Series volumes from the Upper Border Series volumes.

The abundant primocrysts in the intrusion are plagioclase, olivine, pyroxene (high-Ca augite and low-Ca pigeonite since inverted to orthopyroxene), and Fe-Ti oxides. The Middle Zone of the Layered Series, the Middle Zone of the Marginal Border Series, and the Upper Border Series *β*-zone lack olivine primocrysts, their place taken by low-Ca pyroxene.

### *43.4.1 Pearce Element Ratios and the Skaergaard Intrusion*

We would like a Pearce element ratio design such that the products of sorting of all the mineral-melt assemblages in the intrusion would have compositions that generate points along a straight line with a slope of one. Unfortunately, nature prevents construction of such a diagram. The stoichiometry of olivine, (Mg, Fe)2SiO4, and low-Ca pyroxene, (Mg, Fe)2Si2O6, with their different ratios of (Mg, Fe) to Si lead to an inconsistent set of algebraic equations in the design matrix (Nicholls and Russell 2016; Nicholls and Gordon 1994). We can, however, design two diagrams, one that accounts for sorting of olivine, plagioclase, augite, and Fe-Ti oxide and another that accounts for sorting of low-Ca pyroxene, plagioclase, augite, and Fe-Ti oxide.

Two ratio pairs that account for the abundant phases and their different compositions are:

[0.25 Al + 0.5(Fe + Mg) + 1.5 Ca + 2.75 Na]/K versus (Si + 1.5 Ti)/K (Olivine in the sorted assemblage)

and

[0.5 Al + Fe + Mg + Ca + 2.5 Na]/K versus (Si + 3 Ti)/K (Low-Ca pyroxene in the sorted assemblage)

Pearce element ratio diagrams for the two ratio pairs appear on Figs. 43.4 and 43.5. Figure 43.4 shows the diagram for olivine in the sorted assemblage whereas Fig. 43.5 shows a diagram for low-Ca pyroxene in the sorted assemblage.

**Fig. 43.4** Pearce element ratio diagram designed to show the effects of sorting plagioclase, augite, olivine, and Fe-Ti oxide (Usp75). Accumulation of Ca-poor pyroxene in addition to the listed minerals would cause data points to fall away from the model line along trends parallel to the arrow. The grey ellipse represents the size of the 1σ uncertainty in the location of the data point for UZb\*

The points on the diagrams were calculated from the mean values of the compositions (McBirney 1989a, 1996). On both diagrams, the points are distributed along a trend with a slope of one but with considerable scatter; more scatter than found in trends calculated for suites of cogenetic volcanic rocks (compare Figs. 43.4 and 43.5 with diagrams in Nicholls and Russell 1991, 2016). The Skaergaard data span a larger range of values than do data from volcanic suites when plotted on similar Pearce element ratio diagrams. Data collected from basaltic volcanic suites, when plotted on comparable diagrams, span approximately 50 units (see Nicholls and Russell 1991). The Skaergaard data span approximately 250 units.

Although the number of analyses for several of the units in the Skaergaard Intrusion is large enough to make the mean values relatively stable in the sense that one more analysis would have a small effect on the mean, especially if the one analysis were for a rock like the ones analyzed. However, the large standard errors attached to the mean values opens the possibility that analyses of another set of samples of the same size collected from the same unit could result in a different set of means for the constituent oxide values.

Propagating the standard error of the means through the procedure for calculating the uncertainty in the location of a data point (Nicholls 1990b) produces large ellipses of 1σ analytical uncertainty in the location of the data point. The smallest ellipses for the data points shown on Figs. 43.4 and 43.5, belong to the points representing the mean of the UZb\* unit of the Marginal Border Series.

The sizes of the uncertainty ellipses render them useless for testing the model. Almost any line with a slope of one will intercept the uncertainty ellipses. The model cannot be rejected because of the scatter of the data points off almost any line with a slope of one that we can pick.

Although the data points on Figs. 43.4 and 43.5 fall along a trend with a slope of one, the scatter about the trend precludes there being an obvious choice for a point through which to draw a model line. We could draw lines with unit slopes through every one of the data points but could not justify picking any one line over the others.

We can, however, calculate the mean compositions of each series (LS, MBS, UBS) by weighting the mean compositions of the units in the series by their respective relative volumes. The points derived from the weighted means are plotted as diamonds on Figs. 43.4 and 43.5. The points representing the weighted means do fall on a trend with a unit slope and with less scatter than do the full set of data points. It is a straight-forward procedure to find a line with a slope of one that falls closest, in the least-squares sense, to the three points representing the weighted mean compositions of the three series that make up the intrusion. The best fit lines for the weighted means fall close to the respective points (Figs. 43.4 and 43.5), well within any 1σ error ellipse. These lines we will use as our model lines.

The inclusion of olivine or low-Ca pyroxene in the model assemblages produces no statistically significant difference in the efficacy of testing the models that I can see. If the lines defined by the weighted mean compositions for the three Series (LS, MBS, UBS) are the best models, then one would expect the points representing the Middle Zone rocks (MZ, MZ\*, *β*) on Fig. 43.4 to deviate by falling below the line. They don't fall farther from the line than do points for the other units. Rather, they often fall closer to the line. Possibly, low-Ca pyroxene accumulated in the Middle Zone units in insufficient amounts to be detected with the olivine-sorting model.

On Fig. 43.5, one would expect the points representing the units outside the Middle Zone units to fall above a model line through points representing the Middle Zone rocks. The dashed line on Fig. 43.5 is a best fit line with a slope of one and is defined by the three Middle Zone values (MZ, MZ\*, *β*). The data points for the other units displayed on Fig. 43.5 are displaced as expected if olivine sorting happened; they fall above the line.

The points representing the units (filled circles) fall in overlapping clusters along a trend with a slope of one with the larger units of the Layered Series generally falling up-slope from the points representing the Marginal Border Series units and with the Upper Border Series points falling farthest down-slope. This distribution is consistent with predictions from the computer simulations. The points representing Series compositions (filled diamonds) are also distributed as predicted by the computer simulation; the larger aliquot plots up-slope and the smaller aliquot down-slope.

The trends followed by the data points on Figs. 43.4 and 43.5 are consistent with the predictions of the models. Given the size of the uncertainties in the locations of the data points, there is no evidence that more than one magma was involved in the formation of the Skaergaard Intrusion.

### **43.5 Melts of the Skaergaard Intrusion**

Three categories of melt crystallized to form the Skaergaard Intrusion: the melt that initially filled the magma chamber, the subsequent melts residual to each crystallization stage, and the melts trapped between the primocrysts. Melts trapped in the oldest part of a unit would have a different composition from melts trapped in the youngest part of a unit. Melt trapped in the youngest part of the unit would have the composition of the residual melt at the time of entrapment. Between crystallization of the oldest and youngest crystals in the units, the trapped melt would have compositions gradational between the two.

Any melt that existed in the Skaergaard crystallized long ago. Perforce, estimates of their compositions and their nature must be inferred. Melts whose compositions we can infer are those for the initial melt and the residual melts filling the magma chamber at the end of the formation of each rock unit and the beginning of the next.

### *43.5.1 The Initial Melt*

Pearce element ratios for estimated compositions of the initial melt are plotted on Fig. 43.6. The initial melt composition should plot down slope from the point representing the mean composition of the Layered Series. Estimates of the initial Skaergaard magma have been made by Wager (1960), Hoover (1989a), McBirney (1996), Ariskin (1999), and Nielsen (2004). Wager (1960) used a composition from a sample from the chilled margin of the intrusion. Hoover (1989a) also used an analysis from a sample of the chilled margin but complimented it with melting experiments. Ariskin (1999) used thermodynamic modeling to make his estimates. AA1 (Fig. 43.6) is his preferred value. Nielsen (2004) based his estimate on volumes and average compositions complimented by comparison with chilled margin compositions and compositions of Tertiary basalts found near the intrusion. McBirney (1996) based his estimate on the mean composition of three samples from the chilled margin.

The estimates made by Wager (1960) and Ariskin (1999) do not fit the pattern we expect. A point representing an initial melt on a Pearce element ratio diagram should plot down-slope from the point representing our best estimate of the bulk composition of the intrusion (grey diamond, Fig. 43.6). I think it a tribute to the acumen of the estimators that all the preferred values fall close to the model line defined by the points representing the compositions of the weighted means of the major units of the intrusion.

### *43.5.2 Residual Melts*

In addition to values for the mean compositions of the rock units of the Skaergaard Intrusion and estimates of the compositions of the initial melts, there are at least two

**Fig. 43.6** Pearce element ratio diagram showing the points derived from the mean compositions for Skaergaard rocks (McBirney 1996), estimates of the composition of the original Skaergaard magma (Wager 1960; Hoover 1989a, b; McBirney 1996; Nielsen 2004). The ratios plotted on the axes of the diagrams are designed such that melt and rock compositions should fall on a line with a slope of one if potassium (K) was conserved in the melts during crystallization of olivine, calcic pyroxene, plagioclase and an Fe-Ti oxide (Usp75)

estimates of the compositions of the melt that filled the magma chamber at the time the particular crystal mush was in place: (1) experimentally determined compositions (McBirney 1996, red circles on Fig. 43.7) and (2) compositions derived through thermodynamic modeling (Ariskin 2002, green triangles on Fig. 43.7).

Felix Chayes was a petrologist who used mathematics in innovative ways to understand petrologic processes at a time when most petrologists knew little about mathematics. Among his many contributions was a small text that enhanced our understanding of the roles ratios can play in inferring petrologic processes (Chayes 1971). I met him but once at the 1967 meeting of the Geological Society of America in New Orleans. I was one of a number grad students and academics gathered in a night club. I later corresponded with him in the late 1980's about the efficacy of the correlation coefficient as a statistic for testing Pearce element ratio models. That correspondence caused me to use the designed slope of the line on a Pearce element ratio diagram as a characteristic of the model rather than a line fit to the data by least-squares methods. The designed line can then be compared to the data. Hence, one doesn't need the correlation coefficient to evaluate Pearce element ratio models. I think the same realization came independently to several others, notably Kelly Russell and Cliff Stanley, at about the same time.

In 1970 Chayes published a scheme for calculating residual melt compositions in the magma chamber and trapped in the mush during crystallization. His equation is: **Fig. 43.7** Pearce element ratio diagram showing the locations of points representing residual melt compositions at the end of the crystallization of the coeval units of the Skaergaard Intrusion. residual melt compositions estimated by McBirney (1996) and Ariskin (2003), and points calculated with Chayes (1970) algorithm

$$\mathbf{M}\_{k+1} = \left[\mathbf{M}\_1 - \sum\_{j=1}^k \left(\mathbf{P}\_j X\_j\right)\right] \Big/ \left(1 - \sum\_{j=1}^k \mathbf{P}\_j\right), 0 < k < n\pi$$

where the **M***i* are vectors whose elements are a set of oxide values in the residual melt and *n* is the number of units in the intrusion. **M**<sup>1</sup> is the vector containing the oxide values for the initial melt. The *Pj* are the volumes or proportional volumes of the units in the intrusion. The *Xj* are the mean values of the oxides in the units of the intrusion.

The values contained in the **M***i, i* > 1, depend of the values contained in **M**1. Change the values in **M**<sup>1</sup> and the values in **M***<sup>i</sup>* change.

All values for the initial melt, the **M**1, except those estimated by McBirney (1996) generate negative values for some of the oxides in the **M***k* at later stages in the evolution of the residual melts (*k* > 3). The Pearce element ratios for residual melts generated with Chayes' (1970) equation using McBirney's (1996) estimate for the values in the initial melt are shown with solid black circles on Fig. 43.7.

At any stage in the evolution of the Skaergaard Intrusion, the residual melt is simultaneously depositing crystals on the floor, walls and roof of the magma chamber, at least according to the simplest paradigm. The points to be compared, then, to the simulated patterns are the weighted means of the coeval units. Pearce element ratios for the three sets of residual melts: (Chayes 1970 algorithm; McBirney 1996; Ariskin 2003) can be compared on Fig. 43.7. McBirney's (1996) estimates for the compositions of the residual melts at the end of LZa, LZc, MZ, UZa, and UZb do not fit the expected pattern in that they plot up-slope from their respective cumulate compositions. All of the points representing the residual melt compositions estimated by Ariskin (2003) plot down-slope from the points representing their respective cumulate compositions as do residual melt compositions calculated with Chayes' (1970) algorithm. Only the latter however, fall in sequential order, a pattern expected for a series of melts formed by fractionation of a single initial magma.

### *43.5.3 Relative Amounts of Trapped Melt*

We can make qualitative assessments of the amount of melt trapped in the cumulates by plotting the relative volumes of the units in the intrusion against the position of the Pearce element values along the model line (see Fig. 43.2).

Data points on Pearce element ratio diagrams need not fall exactly on model lines, which makes calculating distance along the model line less straight-forward than given in the formula above (see Sect. 43.3). To calculate distance from a point representing a melt composition to a point representing a cumulate composition we measure the distance along the model line between two points that are the closest to each of the two points in question. The point of closest approach will be along a line through the point and normal to the model line. An example is shown on Fig. 43.8 for the coeval Lower Zone units (LZa, LZa\*, and *α*<sup>1</sup> ). The points on Fig. 43.8 represent the initial melt composition (McBirney 1996, black star) and the mean compositions of the units (McBirney 1996 coloured circles).

0 50 100 150 200 250

(Si + 1.5 Ti)/K

Figure 43.9 shows the distances along the model line versus unit size expressed as a percentage of the volume of the intrusion (Nielsen 2004). The left hand sides of the triangles defined by points representing coeval units are approximately vertical; in other words, the units whose points define the left hand sides of the triangles are approximately the same size. On Fig. 43.3, the ratio of trapped melt to primocryst in the crystal mush decreases upwards along a vertical line. If the same pattern carries over to real-world data, then the amount of trapped melt, relative to primocryst amount, is smaller in the UBS units than in the MBS units.

The lack of independence in the estimates of the volumes of the units does not in itself invalidate these conclusions. The estimates of the relative volumes may be correct; we just have less confidence that they may be. Because we are using the estimates in a qualitative fashion, the chances that our conclusions are reasonable improve.

Contours of equal trapped melt to primocryst ratio have a positive slope on Fig. 43.3, which illustrates the pattern of points in the simulation model. If the pattern applies to the real world, the upper boundaries, with negative slopes, of the triangles representing the coeval LZa and UZa units (red and yellow triangles) cannot be parallel to contours of equal ratio. We infer, then, that for these two sets of coeval units, the ratio of trapped melt to primocryst amount was smaller in the MBS units than in the LS units.

It is unlikely coeval units of the LS and the UBS would have the same ratios for trapped melt to crystals. Consequently, the lines drawn between points representing LS and UBS units are probably not lines of constant ratio (compare Fig. 43.3).

### **43.6 Pearce Element Ratios, Cumulate Rocks, and September 11**

Wager's discovery of the Skaergaard Intrusion and his recognition of its significance to igneous petrology and Pearce's insight that led to Pearce element ratios opened ways to decipher how cumulus rocks came to be. Understanding how these rocks came to be can affect our lives. They host ore deposits of chromium, nickel, and the platinum group elements (ruthenium, rhodium, palladium, osmium, iridium, and platinum), elements required by our civilization. To know more about how they came to be adds to our understanding of the Earth.

For nearly a decade I did little to extend the range of application of Pearce element ratios. An invitation to contribute to a review paper by *Geoscience Canada* led me to look at cumulate rocks through the lens of Pearce element ratios. Perhaps the perspective articulated by Stephen J. Gould in a piece he wrote for Canada's newspaper, the Toronto *Globe and Mail*, shortly after the events of 9/11 (Gould 2001) is apposite. His point: evil events, like 9/11, can cause big changes in our lives whereas many good events come in small packages. The good, however, by their number, eventually outweigh the evil. Maybe application of Pearce element ratios to the study of cumulate rocks can count as one of the small packages.

**Acknowledgements** Discussions with many people helped me learn about Pearce element ratios, in particular, Kelly Russell, Cliff Stanley, Terry Gordon, and Alex Wilson. Thanks to the late Tom Pearce for inventing Pearce element ratios.

### **Appendix: Computer Simulation**

The simulation will be for a single step or stage in the processes that lead to the development of a layered intrusion. The simulated system contains Si, Al, Mg, Ca, K, and P. Crystallization produces forsterite and anorthite with proportions of the two minerals constrained by the concentrations of Si, Al, Mg, and Ca in the melt. A fraction of the initial melt crystallizes to produce a melt modified in composition, some of which is trapped between the primocrysts.

Numbers that have to be specified to run the simulation are a composition for the initial melt (*im*[0], *im*[1], *im*[2], *im*[3], *im*[4]) where the items in the initial melt vector represent molar percentages of the elements: Si, Al, Mg, Ca, K, and P. The size (*S*) of the melt in the simulated magma chamber is entered into the simulation procedure, as is the percentage (*P*) of the initial melt, or aliquot that will supply the forsterite and anorthite crystals in the layer. The size is equal to the number of moles of the elements in the initial melt. The numbers of the different elements in the aliquot will designated as (*aq*[0], *aq*[1], *aq*[2], *aq*[3], *aq*[4], *aq*[5]).

One could assume the simulated magma chamber was initially uniformly mixed and filled with a homogeneous melt. If the composition of the system is known, the simulation could be made deterministic to within two adjustable parameters if a thermodynamic component were included in the model. This is a consequence of Duhem's theorem (see Nicholls 1990a, 2000, 2013). One could also make it deterministic by extracting the maximum amounts of forsterite and anorthite from the aliquot. To add some variability into the simulation, we will sample the initial melt to create the aliquot by following a constrained random number procedure. *P* × *S*/100 random integers, *rn, n* = 1 … *P* × *S*/100, are generated from a uniform distribution between 0 and *P* × *S*/100.


The last two equalities ensure that the two conserved elements, K and P, enter the aliquot in the same proportions as they are found in the initial melt.

From this new melt, forsterite and anorthite crystallize. The amounts of the two phases that can be extracted from the new melt are constrained by the composition of the aliquot. The amount of anorthite that can be extracted depends on the numbers of Ca and Al elements in the melt:

if: *aq*½ - <sup>3</sup> <sup>&</sup>lt; *aq*½ - 1 ̸2, An = *aq*½ - 3 else An = *aq*½ -1 ̸2

The amount of forsterite depends on the number of Mg elements in the melt. Fo = *aq*½ -2 ̸2

Using the amounts of anorthite and forsterite extracted from the aliquot, the numbers of elements in a new melt (*nm*[0], *nm*[1], *nm*[2], *nm*[3], *nm*[4], *nm*[5]) are calculated by:

*nm*[0] = *im*[0] – Fo – 2 An *nm*[1] = *im*[1] – 2 An *nm*[2] = *im*[2] – 2 Fo *nm*[3] = *im*[3] – An *nm*[4] = *im*[4] *nm*[5] = *im*[5]

A melt with the new composition is then trapped between crystals to form the crystal mush. Solidification of the mush produces a layer in the cumulate rock body.

### **References**


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **Chapter 44 Reflections on the Name of IAMG and of the Journal**

**Donald E. Myers**

**Abstract** This note is to highlight the transformation of the names of *International Association for Mathematical Geologists* and its flagship journal *Mathematical Geology* respectively into *International Association for Mathematical Geoscientists* and *Mathematical Geosciences*.

When first approached about submitting something for the special volume I thought the idea was a good one but was not sure what I might have to say that would be relevant and of interest. Initially I planned to simply reflect on my year as Distinguished Lecturer (2008) but somehow it didn't seem sufficient. Instead I want to reflect on three words in the name of the organization and also on the current title of the journal, i.e. *International, Association Mathematical, Geologists* and *Geosciences*. As anyone familiar with IAMG knows it was born in Prague in 1968 in the midst of what turned out to be a momentous event but it also returned to Prague to celebrate its 25th anniversary in 1993. I wasn't one of that moderately small but very influential group but I subsequently knew or still know many of them. I didn'<sup>t</sup> really start working in the field until the early 1970s.

Prior to the 1970s I was only a mathematician but accidentally came in contact with two other faculty at the University of Arizona, Y. C. Kim (Mining Engineering) and De Verle Harris (Mineral Economics) as well as Art Warrick (Soils, Water and Engineering). Hence I was beginning to "Associate". Through those I learned about G. Matheron's work, met Frits Agterberg, André Journel and Shlomo Neuman (Hydrology), developed some collaboration with USGS in Denver and made plans to spend a sabbatical at the Centre de Géostatistique (Fontainebleau) in the spring of 1981. Ghislain de Marsily spent the academic year 1979–1980 at the University of Arizona in the Department of Hydrology. Through Art Warrick I knew of the work of Richard Webster, I was fortunate to be invited to participate in the NATO ASI at Lake Tahoe in 1983 and met many of the others in the very important group in mathematical geosciences.

D. E. Myers (✉)

Department of Mathematics, University of Arizona, Tucson, AZ 85721, USA e-mail: myers@math.arizona.edu

<sup>©</sup> The Author(s) 2018

B. S. Daya Sagar et al. (eds.), *Handbook of Mathematical Geosciences*, https://doi.org/10.1007/978-3-319-78999-6\_44

At this point it is important to note the change(s) that have taken place in the name of the journal. Initially most of the membership would have been geologists or mining engineers but clearly hydrology and soil science are a part of the geosciences so that the interests and membership were expanding in scope. In fairly short order geosciences grew to encompass "environmental sciences", "geography", "ecology", "image analysis", "remote sensing", "epidemiology", "atmospheric sciences" because the stress was on "geo" and not on "ology". Papers in the various soil science journals cited papers in the IAMG journal (and conversely), papers in the various American Geophysical Union cited papers in the IAMG journal (and conversely) and of course the petroleum industry was involved early with the collaboration between Fontainebleau and Shell Oil. It is likely that a list of referees for Mathematical *Geosciences* (and all the previous titles) would cross an ever increasing list of countries and institutions as well as areas of interest.

Except perhaps in France the work of G. Matheron was not really known in the mathematical/statistics community even though his signal paper appeared in the *J. of Applied Probability* in 1973. *Mathematical Reviews* still doesn't really have a category for mathematical geosciences other than geophysics. The statistics community likewise was slow to recognize mathematical geosciences. Most of the interest in Radial Basis functions either relates to solutions for partial differential equations or approximation theory.

The various editors (and publishers) of *Mathematical Geosciences* have been very interested in the impact ratings of the journal but it would be even more interesting to tabulate the number of different journals not closely related to mathematical geology that publish papers citing papers appearing in *Mathematical Geosciences* (including those that might have appeared twenty or thirty years ago. In many fields of science it is not uncommon for the significance or usefulness of a paper to appear many years later. This is especially true of pure mathematics.

As I have tried to point out that geosciences is a more encompassing term than geology (many university departments have changed their names to reflect this), the "mathematical" part of mathematical geosciences has also grown and expanded. In some ways statistics is an outgrowth of mathematics but it is also an outgrowth of agriculture (think of the work at Rothamstead Experimental Station and the many land grant universities in the US) but also the social sciences and economics/ business. Statistics by its very nature is a very cross disciplinary applied area of interest. Another part of "mathematical" pertains to computing. The VAX computer and the software package BluePack were very much a part of the real growth of geostatistics, the desktop computer has created an even greater explosion. I first started teaching a class on geostatistics in 1982 and my students had to use a mainframe CDC 6400 with punch card input, it was terribly inconvenient but without that access the class would have had no practical value. The advances in computing and in access to computing have revolutionized the teaching of statistics in all its very forms.

Clearly IAMG was international from its original founding and that perspective has only grown with time. I can speak to that from a personal perspective both from my experience as the Distinguished Lecturer in 2008 but also as a referee/reviewer for the journal and attendance at various international meetings. I would also note the level of interest evident in the Questions appearing on the ResearchGate.net forum. It is truly international.

Sometimes old ideas come back in a different form. The Design of Experiments originated in applications to agriculture and often emphasized various forms of "plot design" but now it may be important in the design of aircraft wings and may incorporate kriging and/or cokriging. Google tells me that my paper on cokriging (J. of the International Assn of Mathematical Geologists, 1982) is being cited for applications very far afield from the problem I thought I was addressing when I wrote the paper. I am sure other authors of papers that appeared in this journal may have had a similar experience. It is a tribute to the vision of the founders of IAMG back in 1968. "Mining Geostatistics" was a classic when it appeared (the English version) and I am sure that many readers had no interest in mining but there were ideas and concepts in it that were useful for other kinds of problems. The proceedings of the NATO ASI (*Advanced Geostatistics in the Mining Industry*) became "Geostatistics for Natural Resources Characterization" in 1984. Who knows what the future will bring but IAMG and *Mathematical Geosciences* have made a significant contribution. They have influenced the development of mathematics, statistics, computing as well as the various fields that might be grouped under the heading "GEO-sciences".

**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. The images or other third party material in this chapter are included in the chapter's Creative

Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

### **Chapter 45 Origin and Early Development of the IAMG**

**Frits Agterberg**

**Abstract** This chapter is primarily concerned with the first 15 years of our existence (I was a member of the IAMG Founding Committee, and on the 1968–<sup>1972</sup> and 1996–1980 IAMG Councils). Daniel Merriam and Richard Reyment are the principal fathers of the IAMG, and many other scientists have contributed significantly to its origin and early development. Personal contacts with them are briefly described. These comments are supplementary to those already provided in earlier chapters by Founding Members and others who have made significant contributions to the IAMG originally. Special attention is paid to inputs by prominent mathematical statisticians with an interest in geology. I am grateful to all pioneers who have helped to establish the IAMG and provided a climate encouraging younger scientists, including myself, to pursue careers in their field of interest.

**Keywords** IAMG history ⋅ Richard Reyment ⋅ Daniel Merriam Early mathematical geologists

### **45.1 Introduction**

Perspectives on the origin and early development of the IAMG have already been provided in earlier chapters. Most of the following remarks are complementary to these other reminiscences. They are based on documents in the IAMG Archive, private information and what is publicly available on the IAMG Website including Newsletters from 1970 onward.

Richard Reyment had the original vision of establishing our organization as offspring from two parents: the International Union of Geological Sciences and the International Statistical Institute. As a successful example to follow for geologists, he took the biometrical society which was already in existence for quantitative

F. Agterberg (✉)

Geological Survey of Canada, 601 Booth Street, Ottawa, ON K1A 0E8, Canada e-mail: frits.agterberg@canada.ca; frits@rogers.com

<sup>©</sup> The Author(s) 2018 B. S. Daya Sagar et al. (eds.), *Handbook of Mathematical Geosciences*, https://doi.org/10.1007/978-3-319-78999-6\_45

biologists and other life scientists, with its strong component of mathematical statistics. During 1966 and 1967, Reyment sought international support for the formation of our society. Especially mathematical statisticians were very supportive of his idea. He then organized the Founding Committee of the IAMG, although our name was to be chosen later. He invited me to be a member of his committee and chaired our inaugural meeting during the 23rd IGC in Prague where he became the IAMG'<sup>s</sup> first Secretary General.

Daniel Merriam provided us with the essential publication and organizational background support for more than 30 years. In 1969 Dan was the founding Editor-in-Chief of the *Journal of the International Association for Mathematical Geology* (currently: *Mathematical Geosciences*), and in 1975 of *Computers & Geosciences*. Additionally, he was the chief organizer of numerous international meetings in our field, and editor of the proceedings for these meetings, as well as several other multi-author books. Later, in 2001, he took over as Editor-in-Chief of *Natural Resources Research*, our third international scientific journal that had originally been founded by Dick McCammon in 1992 under the name *Non*-*Renewable Resources*. In 1966, as Head of the Mathematical Geology Section, Kansas Geological Survey, Dan established the Distinguished Visiting Research Scientists program inviting mathematical geologists to work with him and his colleagues for successive one-year periods in Lawrence, Kansas. I was happy to accept Dan's invitation to occupy this position in 1969/70. During this fruitful year, my family and I were housed in the Sunflower apartments on the campus of Kansas University and received great hospitality. Merriam left Lawrence in 1976 to become Chair of the Geology Department, Syracuse University, where he commenced a new school for quantitative geoscientists. John Davis succeeded him at the Kansas Geological Survey.

Although originally educated in classical geology and geophysics at the University of Utrecht, I developed an interest in probability and statistics as a graduate student and published some papers on statistics applied in geology. Because of this, I was in 1962 invited to become "petrological statistician" at the Geological Survey of Canada (GSC) in Ottawa, initially to work within the framework of the Canadian Contribution to the International Upper Mantle Project and later to form their Geomathematics Section. The word "geomathematics" was used in analogy with "geophysics" and "geochemistry", but as a term it was never widely accepted. In 1982, engineers in photogrammetry had the idea of abbreviating the same word to "geomatics", which became widely accepted as a new discipline but is quite different from "mathematical geosciences".

GSC management allowed me to participate in the inaugural IAMG meeting on August 22nd, 1968, during the 23rd International Geological Congress in Prague. As described in earlier chapters, this event was disrupted and aborted because of the Russian-led occupation of Czechoslovakia. A list of participants in the inaugural meeting was included in its Minutes (see Appendix for final version of Minutes copied from the IAMG Archive) but several mathematical geologists including Bill Krumbein and Graeme Bonham-Carter, who had been planning to come to our first meeting, were prevented from coming to Prague to participate in the event. Fortunately, my hotel was within walking distance of the Congress Centre and I also had been able to see several Founding Members before our meeting. Soon afterwards I was forced to leave Prague by car in a convoy of Dutch nationals led by the Dutch ambassador in the first car. Reyment had asked me to prepare minutes for our inaugural meeting and I handed him my first draft in Amsterdam where he, Geof Watson and I presented review papers at the Geostatistics Session organized during the 1968 meeting of the International Association of Statistics in the Physical Sciences (Section of the International Statistical Institute). This event helped to consolidate our affiliation with ISI. Formal affiliation with the IUGS had already been achieved in Prague.

### **45.2 Pioneers of Mathematical Geology**

At its annual meetings the IAMG continues to honor five most eminent, pioneering scientists in our field: William Christian Krumbein, Andrey Borisovich Vistelius, John Cedric Griffiths, Felix Chayes and Georges Matheron. I was fortunate to know all five of them. Other leading scientists with strong IAMG involvements included John Tukey, Geof Watson, Danie Krige, Tim Whitten, Jean Serra and Walther Schwarzacher. Merriam and Howarth (2004) arranged for the publication of biographical articles on Matheron, Griffiths, Chayes, Reyment, Krumbein and Vistelius in a special edition of *Earth Sciences History*.

Krumbein (1936, 1939) already was developing important statistical techniques for geologists in the 1930s. My initial contact with him took place in the fall of 1961 when I was a postdoctorate fellow at the University of Wisconsin in Madison. My first assignment there was to perform statistical analysis of thousands of measurements on directional features taken by Ph.D. student Garrett Briggs in the Arkoma Basin of east-central Oklahoma (Agterberg and Briggs 1963). My report was reviewed by Krumbein before publication. His helpful comments included the suggestion to expand what initially was a brief footnote into a full section. It said that the circular normal (Von Mises) distribution for vectorial data converges to normal (Gaussian) form when dispersion around the vector mean approaches zero, so that standard (non-directional) statistical techniques including analysis of variance remain approximately applicable. Krumbein said that this remark solved a long-standing problem for him. Later, two of his Ph.D. students working with orientation data made use of this approach publishing their results in the first issue of our first IAMG journal (Jones and James 1969). I did not know at the time that Watson (1960) already had developed better approximations for statistical analysis of directional data. During his career, Krumbein continually sought the advice of mathematical statisticians including Franklin Graybill and John Tukey in order to stay on the right track. In 1963 the GSC invited him to Ottawa as a consultant, and I visited him at Northwestern University in a follow-up visit. Later I saw him regularly at scientific meetings, especially at those organized by Merriam in Lawrence, Kansas.

As a graduate student I gave an economic geology seminar on the skew frequency distribution of ore assays. In preparation I had read Krige's MSc thesis on microfilm in the library of the University of Utrecht. Its published version (Agterberg 1961) drew the attention of Danie Krige who wrote to me about it and became a good friend and esteemed colleague for more than 50 years. In 1963 he came to Ottawa on his way to the 3rd APCOM Symposium held at Stanford University. APCOM stands for "Applications of Computers and Operations Research in the Mineral Industries". With his wife Ansie and a colleague we went to Niagara Falls on a touristic outing. Danie persuaded GSC management that I should attend the 4th APCOM to be hosted by the Colorado School of Mines in 1964. Originally, APCOM meetings provided an important forum for mathematical geologists. I first met Dan Merriam, John Harbaugh, Tim Whitten and many others at early APCOMs.

In 1965 the GSC allowed me two months of travel abroad provided that I paid for my own travel expenses. First I went to the Netherlands where Codien Zwaardemaker invited me to dinner (we got married later that year; from 1993 onward she accompanied me to all IAMG annual meetings except one). From Amsterdam I went on to visit Krige in Johannesburg who took his family and me to the Kruger Park. Next there was the 8th Commonwealth Mining Congress in Australia, and finally the 5th APCOM at the University of Arizona, where I presented statistical analysis results for chemical analyses from the Muskox Layered Intrusion in northern Canada that was considered to be a sample of the upper mantle (Agterberg 1965). After this presentation John Griffiths came forward to congratulate me, also inviting me to present two papers instead of one at the next (1966) APCOM he would be hosting at the Pennsylvania State University. In those days, politicians in public paid more attention to oil and ore than today. The U.S. Secretary then in charge of mineral resources and mining gave the post-Symposium dinner speech. One of my two papers (Agterberg 1966) was entitled "Markov schemes for multivariate well data" and the Secretary singled this one out for a Cold War joke. Griffiths became one of my principal mentors. In 1968 Elsevier invited me to write a geomathematical textbook (Agterberg 1974). Griffiths and Merriam read all chapters and offered numerous helpful comments. Later I was honored to be invited to write the first chapter in the Griffiths commemorative book "Future Trends in Geomathematics" (Craig and Labovitz 1981).

Andrey Vistelius was the first IAMG President and his Laboratory of Mathematical Geology was used for our IAMG name. Tim Whitten, who was with Krumbein at Northwestern University, Evanston, Illinois, had invited him to come to North America in 1975 and for the last two weeks of this visit he was in the Geomathematics Section at the GSC in Ottawa. Before arrival, Vistelius had expressed the desire to sample a Canadian granite intrusion, preferably one with associated tin mineralization. There exists such a granite body in Nova Scotia but logistically we could not mount an expedition to sample it. Instead, with the help of other geologists we sampled the Meach Lake aplite body close to Ottawa. Aplite is fine-grained granite and this turned out to be a practical advantage, because thin sections of rock samples that could be cut in Ottawa were much smaller than the very large thin sections Vistelius had produced in Leningrad for counting frequencies of transitions between different minerals in granites. In total 104 thin sections were transition-counted and statistically analyzed. The rock body was interpreted to be "ideal granite" in which sequences of mineral grains are Markov chains (Vistelius et al. 1983). Later Xu et al. (2007) provided an alternative multifractal explanation of the Meach Lake aplite textures.

While Vistelius was in Ottawa, a preliminary itinerary was set up for my 6-week visit to the Soviet Union that took place two years later. It commenced with a 10-day stay in Novosibirsk where I participated in the Siberian Seminar on "Application of Mathematical Methods and Computers for Mineral Search and Prospecting" organized by Yuri Voronin. Václav Němec, IAMG Treasurer (East) was participating as well. Neither Vistelius nor Founding Member Dmitry Rodionov attended. Němec was our IAMG ambassador to the Soviet bloc countries (cf. Agterberg 1994). My Siberian Seminar contribution (Agterberg 1977) was the only presentation with slides. Initially, the organizers told that I could only show three slides, because other participants were not allowed to display more than three posters but they relented. A slide projector was brought in from another institute and all my slides were shown. Before I was leaving for Moscow on the next stop, Němec had warned me that during my upcoming visit to Rodionov and his colleagues I would be asked for an opinion on the work of Voronin and his team; he explained that a negative opinion could be detrimental because Moscow controlled funding of the Novosibirsk projects. I was careful in what I said. It was understood in the Soviet Union that the farther east you went, the more philosophical the mathematical approach to geology became. I learned at the Siberian Seminar that rocks are subject to the basic philosophical principle that the "whole is more than the sum of the parts".

The last two weeks of my visit to the Soviet Union were spent in Leningrad. Every day I arrived at the Laboratory of Mathematical Geology 2 h before Vistelius, who did most of his work at home where we went in the afternoon for discussions and a meal. As explained by Steve Henley, Vistelius was given a hard time under the communist regime because of his aristocratic roots. In order to accept an invitation for a lecture tour he had just received from Japan, he needed numerous approvals. The process, which involved various unpleasant interviews with officials plus extensive form-filling, took more than two weeks. On the day of my departure Vistelius received a phone call from somebody he referred to as a "foxtail" who communicated indirectly to him what could be interpreted as final travel approval. The foxtail did not communicate this in so many words but said that an official in Moscow had remarked that the Laboratory of Mathematical Geology in Leningrad did good work. This implied approval and Vistelius went indeed to Japan shortly afterwards. During our many discussions we were not always in total agreement. Vistelius held very strong opinions and was not at all impressed by geostatistics or geostatisticians. He felt that mathematical geology had to be "pure" and not contaminated with economic motivations. Even much later, after he had invited me to participate in a mathematical geology meeting, he pointed out that in his session there would be no room for statistics applied to ore deposits, but he suggested other topics on which I could report.

My recognition of the validity of French geostatistics took place in 1964 because of a curious incident. Our library had obtained a copy of the first book by Matheron (1963) but there had been a complaint from the public that this volume contained absolute nonsense and should be removed from the shelf. The head of the Library Committee approached me and asked for an evaluation because: "We don't want bad books on our shelves". My report was favorable and the book could stay. Although this is not universally known, Georges Matheron commenced his career at the French Geological Survey (BRGM) in 1954. One of his first publications (Matheron 1955a) concerns the Gara Gjebilet oolithic iron deposit in Algeria. It is a standard geological publication with detailed descriptions of the stratigraphy, structure and genesis of this deposit of Early Devonian age plus a folded geological map in the back. It seems that Matheron started out as a classical geologist but shortly afterwards he published a paper (Matheron 1955b) on applications of statistical methods for ore reserve estimation. This first paper foreshadowed the revolutionary approach to spatial statistics he was to bring about during the last 40 years of the 20th century. Like Vistelius, Matheron had strong opinions on topics that would be suitable for research. His first two Ph.D. students (Michel David and André Journel) ran into significant problems later on, when in some of their projects they deviated from what Matheron felt was appropriate for them. In 1968 Michel David had come to the École Polytechnique in Montreal and we collaborated on several projects. One of these involved correspondence analysis (Agterberg and David 1979). But one day David showed me a letter from Matheron stating that this work should be stopped immediately and that he should return to working full-time on geostatistics.

In 1968 Georges Matheron established the Centre de Morphologie Mathématique in Fontainebleau, as a research institute of the École des Mines de Paris. Jean Serra was his close collaborator. Matheron's preferred mode of work was to be in his office in Fontainebleau during the day. He would document his findings in limited-edition geostatistical notes. Fully concentrating on his research, he did not like to speak English nor extensive traveling. I visited him three times. Although for about 10 years my position at the GSC was classified as "bilingual", I never spoke French in Ottawa because all French Canadian colleagues spoke English. However, speaking French was a requirement for personal (and telephone) contact with Matheron. An extra benefit of making the geostatistical pilgrimage to Fontainebleau was that I could consult the numerous geostatistical notes in their library and could bring back to Ottawa any copies of particular interest. Today all these notes are freely available on a website maintained by the École des Mines de Paris. I am sure they continue to contain valuable information that is relatively unknown. During the late 1970s I programmed in FORTRAN some of the methods developed by Matheron and Serra. Twice, I received a *Computers & Geosciences* best-paper award for these efforts. I was pleased to be asked in 1975 to chair a session at the first Geostatistical World Conference held in Frascati, Italy, at which Georges Matheron presented a philosophical paper (Matheron 1976). At the 53rd Session of the International Statistical Institute in Seoul, August 2001, Georges Matheron was honoured as one of the greatest mathematical statisticians during the second half of the 20th century (cf. Baddeley 2001). After obtaining approval from Mrs. Matheron, the IAMG established its annual Georges Matheron lecture in 2005, delivered for the first time by Jean Serra at IAMG2006 in Liège. Our Matheron Lecture was modeled after the Fisher Memorial Lecture initiated by the International Statistical Institute in 1966.

Felix Chayes was a member of the IAMG Founding Committee and participated in many IAMG events. His numerous contributions have been documented by Howarth (2004). Upon his death in 1993 he left the IAMG a significant legacy in order to fund the biennial Felix Chayes Prize for Excellence in Research in Mathematical Petrology. For many years Chayes was involved in compiling large databases with worldwide data on Cenozoic volcanic rocks. This effort included directing International Geological Correlation Programme (IGCP) Project 163 (1977–1984) IGBA (Igneous data Base) which had supportive software as well. Close IAMG involvement with IGCP had been promoted by Merriam who also helped initiate IGCP Project 148 (1976–1983) "Quantitative Stratigraphy".

John Cubitt was the original leader of IGCP Project 148 but he left Syracuse University where he was with Merriam in 1977 to become a private consultant in the U.K. and I took over from him. We created a group of lecturers to present one-week short courses on the subject that eventually were held in as many as nine different countries. The strategy was to attract staff from oil companies in "developed" countries willing to pay registration fees that were later used to give the course in "developing" nations. Walther Schwarzacher and I were part of this "traveling circus". Originally, I had met Schwarzacher in Lawrence, Kansas, where we were both associated with Merriam's quantitative geology group. He was the IAMG's second Krumbein Medallist in 1977 (John Griffiths was the first a year earlier). In the IGCP Project 148 short course Schwarzacher lectured on lithostratigraphic correlation. Later he published a book that explained the Milankovitch theory (Schwarzacher 1993) according to which very small periodic variations in solar radiation create major climate changes on Earth. This idea had been anticipated by Croll (1875) as an explanation of the ice ages. Currently, the entire post-Cretaceous international geologic time scale is based on Milankovitch theory.

Walther and I had several things in common. In Europe we had attended similar high schools called "gymnasium" in both Austria and the Netherlands, at which the emphasis was on Latin and Greek. We still could recite some of the Odyssey to each other. Later I tried some of my ancient Greek on Roussos Dimitrakopoulos who smiled benevolently. The supervisor of Schwarzacher's Ph.D. project had been Bruno Sander at the University of Innsbruck. Later (in 1957) I took a short course at this university in order to learn micro-tectonics in preparation of my fieldwork during four successive summers in northern Italy (Agterberg 1961). The most important results of this doctoral thesis were included in Whitten (1966)'s textbook on structural geology. Later, Hannes Thiergärtner and Heinz Burger invited me to contribute further articles on this subject on two occasions. Original Alpine deformation patterns for the basement of the Italian Dolomites had to be re-interpreted in terms of rapid movements of the Adria microplate that presently keep on creating earthquakes in the Apennines (cf. Agterberg 2014).

### **45.3 Inputs from Mathematical Statisticians**

Most important among the first mathematical statisticians was Ronald Fisher (1954) who suggested that geology with Lyell (1833) had been evolving as a more quantitative science but, rapidly, opposition against this development grew to the extent that Lyell's elaborate tables and statistical arguments (60 pages long) for his subdivision of the Tertiary were omitted from later editions of his *Principles of Geology*. In 1952 Fisher commenced giving regular talks on continental drift (cf. Fisher Box 1978. p. 440) lamenting that geophysicists and geologists were failing to take seriously Alfred Wegener's ideas on continental drift proposed in 1912. Plate tectonics only became generally accepted as a theory in the mid-1960s.

My Moscow stay in 1977 would have included visiting Andrey Nikolayevich Kolmogorov (1956) who originally formulated the axioms of probability calculus in his famous paper of 1931. Unfortunately, this visit had to be canceled for medical reasons. Like Krumbein in North America, Vistelius regularly consulted with mathematical statisticians and Kolmogorov was a major source of inspiration to him.

In 1983 the traveling circus of IGCP Project 148 was at the Indian Institute of Technology in Kharagpur. The lecturers included Geof Watson, 1968–1972 IAMG Vice President, who within 2 h filled an extra wide blackboard entirely with equations on the relationship between kriging and interpolation splines. It is doubtful that anybody in the audience (including me) could understand what he was talking about. Later I spent significant time understanding his subsequent paper on the subject (Watson 1984). I used smoothing splines extensively for estimating the ages of stage boundaries (with 95% error bars) in the International Geological Time Scale (Gradstein et al. 2004). Watson has done much to make Matheron's work in the fields of geostatistics and mathematical morphology better known in the English-speaking world. He persuaded Matheron (1975) to write his book on random sets and integral geometry. At the time Watson told me that there would be only three people in world able to understand this book from beginning to end.

Originally, Watson (1960) had developed statistical methods for directional features that were similar to methods for ordinary data originally developed by Fisher who was the world's most outstanding mathematical statistician during the first half of the 20th century. Fisher was from before my time. Some of our earliest IAMG members including Griffiths and Schwarzacher knew him personally. When I attended the 1963 congress of the International Statistical Institute in Ottawa, he had already left for Adelaide, Australia where he spent his last years in retirement. Fisher's life is described in detail by his daughter Joan Fisher Box (1978). During the latter part of the 19th century, Karl Pearson had introduced many basic statistical concepts including the Pearson correlation coefficient and goodness-of-fit tests for contingency tables, basing his approach on normal (Gaussian) distribution models. Fisher derived the mathematical equation for the frequency distribution of the Pearson correlation coefficient and introduced numbers of degrees of freedom for various statistical methods that became widely used, also by the early mathematical geologists. In these methods extensive use was made of independent identically distributed (*iid*) random variables, contrary to geostatistical applications in which the emphasis was on "regionalized" variables that generate observed values that are not stochastically independent but spatially correlated.

In 1966 the GSC allowed me to participate in the Advanced Statistical Seminar at the University of Wisconsin organized by Fisher's son-in-law Box. During the Icebreaker I was introduced to John Tukey who told me about his interest in geology. At this seminar he presented "The Fast Fourier Transform, for fun and profit" (cf. Cooley and Tukey 1965). Back in Ottawa, I received a box filled with about 2000 IBM cards for running the FFT in 1, 2, or 3 dimensions on our mainframe computer. During the next 25 years, Tukey commented on my projects at the GSC in three of the approximately 800 publications he authored or co-authored (cf. Agterberg 2001; Tukey 1984). Like Matheron, he was recognized at the 2001 ISI Congress in Seoul as one of the greatest mathematical statisticians alive during the second part of the 20th Century. With Watson who had become Chair of the Princeton University Statistics Department, where Tukey was a professor, he attended the 1969 Geostatistics Colloquium organized by Dan Merriam in Lawrence, Kansas, that also had Matheron, Krumbein and Serra as participants.

Watson owned a cottage on Blood Hill near Elizabethville in the Adirondacks, New York State, not too far from Ottawa. In those days, the GSC maintained a pool of cars with the words "Geological Survey of Canada" in big letters on the sides. I could use one if these cars to visit Watson during weekends. Once I drove Geof and some of his family members to Princeton where Tukey spotted us on the campus. He started laughing and pointing his finger at Watson suggesting that Geof had become a "geologist". Watson stimulated me to improve my mathematical skills. Pointing out some errors in a review of Agterberg (1974) he had, somewhat sarcastically, remarked that one could see I was not trained as a mathematical statistician. However, he would have granted me an MSc degree in this discipline. Subsequently I worked hard on my mathematics. In 1983 I organized a geomathematical workshop at the GSC in Ottawa with Geof Watson, Jean Serra and Benoit Mandelbrot among the presenters. Mandelbrot who had coined the word "fractal" like Matheron had been a student of Paul Lévy at the École Polytechnique in Paris. Other participants in our workshop included the directors of Carleton University'<sup>s</sup> Centre of Mathematical Statistics who shortly afterwards invited me to become an Adjunct Professor in their Mathematics Department. I felt this was almost as good as a Ph.D. in mathematical statistics. Personally, I have always felt that this discipline offered me more challenges than conventional geology although this remains a scientific discipline in its own right.

### **45.4 Concluding Remarks**

The preceding remarks are to a large extent personal like several reminiscences in earlier chapters. I have tried to add to these other contributions, above all attempting to bring out the generosity our pioneers extended to younger colleagues. By their research and contributions to the IAMG they insured a healthy organization that should continue to exist and expand for many years to come.

### **Appendix: Minutes of the First Meeting of the International Association for Mathematical Geology, Prague, August 22, 1968**

The meeting was attended by 20 representatives from 10 different countries (see attached list of participants).

After a general introduction by the acting chairman, R. A. Reyment, the following two problems were discussed:


The relatively short name of "International Association for Mathematical Geology (I.A.M.G.)" was adopted for the Society.

A. B. Vistelius proposed discussion of possible classes of membership and also which categories of members should be entitled to vote in the General Assembly. It was pointed out that the Association should consider the options of (a) voting by country (each country one vote) or (b) as individual scientists. However, membership should be open to all scientists. The possibility of having a fixed number of voting members was also discussed. It was felt that the latter procedure may be unfair to the larger countries.

Article 7 of the proposed Statutes (each member of I.A.M.G. one vote) was adopted. However, this discussion resulted in the following change in Article 10 of the proposed statutes:


The following by-law was adopted:

"By-law 7: Not more than two ordinary members, and/or four members of the Council shall be from the same country. This by-law shall be reviewed every four years by the General Assembly."

The matter of introducing a journal was discussed. First, the following by-law was accepted:

"By-law 8: The editor-in-chief, in consultation with the Council, shall be empowered to appoint up to four associate editors."

The Assembly adopted a motion initiated by G. S. Watson "that the Society shall have a journal".

After the acceptance of the statutes and by-laws had been reached and general agreement there shall be a journal, the chairman proposed to the Assembly the electing of the officers of the Council.

The following 13 members of the Council were elected:

A. B. Vistelius—President

G. S. Watson—Vice President (also president elect)

R. A. Reyment—Secretary General

V. Němec—Treasurer (east)

T. V. Loudon—Treasurer (west)

W. C. Krumbein—Past President (instead of Immediate Vice President, see by-law 9)

D. F. Merriam—Editor-in-Chief

D. F. Rodionov, S. P. Sen Gupta, F. P. Agterberg, G. Matheron, D. G. Krige, E.

H. T. Whitten—Ordinary members.

The following by-law was accepted:

"By-law 9: For the first four years of the Society's life, instead of an immediate past president, there shall be an additional vice president."

Since some of the elected members were not present at this meeting, the following motion initiated by J. W. Harbaugh, was adopted:

"If an elected member should not wish to serve on the council, Professor Vistelius shall nominate the next member on the list." Prof. Vistelius has a list of persons eligible as ordinary members and the number of votes they received at the election.

P. Wilkinson moved that: "The Association encourages, in principle, the formation of national groups in mathematical geology and that the question of affiliation should be discussed at the next General Assembly in Montreal." This motion was adopted.

Finally, the policy and objectives for the journal were discussed. It was suggested that there should be a broad editorial program. Similar to that of the biometrical journal Biometrics. The editor-in-chief should prepare guidelines for the journal. The first issues should also contain educational papers.

The official languages of the organization are French, English, German and Russian. It is appreciated that the editing of papers in Russian may present a problem to the editor-in-chief, and in practice only two or three languages will be used for publication. All articles shall have an abstract in English.

List of participants, First meeting of International Association for Mathematical Geology, Prague, August 22, 1968.

R. A. Reyment (Sweden) D. A. Rodionov (U.S.S.R.) A. B. Vistelius (U.S.S.R.) F. P. Agterberg (Canada) H. Knape (G.D.R.) H. Thiergärtner (G.D.R.) G. S. Watson (U.S.A.) V. Němec (Czechoslovakia) D. J. Burdon (FAO of United Nations) C. J. Dixon (U.K.) P. Wilkinson (U.K.) T. V. Loudon (U.K.) R. Ivanov (Bulgaria) V. Kutolin (U.S.S.R.) F. Benkö (Hungary) E. H. T. Whitten (U.S.A.) R. B. McCammon (U.S.A.) J. W. Harbaugh (U.S.A.) R. Hesse (F.R.G.) D. F. Merriam (U.S.A.)

### **References**


Agterberg FP (1974) Geomathematics. Elsevier, Amsterdam


**Open Access** This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.